Chapter 3

Statistical measures

For a good description of collected data material, we also need statistical measures (or statistics) in addition to tabular and graphical representations. With these statistics, the distribution of the data, which was graphically visualised in chapter 2, can be summarised in a compact and comprehensible way by few numbers. When dealing with quantitative data, we primarily distinguish between location measures and dispersion measures.

Educational objectives

After having worked through this chapter you will be familiar with different statistical measures (statistics) and their meaning, in particular you will know the difference between location measures and dispersion measures. Furthermore you can calculate statistics yourself and you can also construct and interpret a boxplot.

Key words: mean value, median, quartiles, quantiles, variance, standard deviation, range, interquartile range.

Previous knowledge: type of variable (chap. 1), boxplot (chap. 2)

In the following table Dr. Frank N. Stein has listed the age of the \(18\) female chemical company employees. The \(18\) values of age are from now on represented by \(x_1, x_2, . . . , x_{18}\). Hence e. g. \(x_5 = 35\).

Table 3.1: Height of the \(18\) female chemical company employees
Patient Age (yr) Patient Age (yr)
1 33 10 47
2 41 11 51
3 58 12 56
4 40 13 32
5 35 14 37
6 41 15 37
7 43 16 44
8 47 17 65
9 36 18 52

These data are visualised in the following boxplot.

Figure 3.1. Boxplot of the age of the \(18\) female employees

Central questions:

    (i) How can these data be summarised with few informative statistics (measures)?
    (ii) How is the boxplot constructed from the data in table 3.1?

Figure 3.2. Scheme of chapter 3

3.1. Measures of Location

Question: How old are Dr. Frank N. Stein's female chemical company employees?

This question can be answered by examining the individual age values. However, the table gives too many details. The goal is to get a summary of the age values using a measure of location.

Measures of location provide information on the magnitude of the collected data, i.e., on their location on the underlying numerical scale. The goal is to calculate a number which represents the data well. A measure of location gives an answer to the question "Where are the data located?". This question can be specified in different ways: `

    Where is the middle part of the data located?
    Where are the smallest and the largest data located?

According to the various possible questions, there are also various measures of location.

3.1.1. Mean value

Dr. Frank N. Stein would like to find out the mean age of his female chemical company employees. To do so, he has to sum up all the age values of the \(18\) women and divide the sum by \(18\). Therefore the mean age is \[ \bar{x} = \frac{x_1 + x_2 + ... + x_{18}}{ 18} = \frac{795}{ 18} = 44.17 \, \text{years} \] The mean value is also called "arithmetic mean", "average value" or just "mean". It is usually denoted by \(\bar{x}\) (read: x bar). if the individual observations are denoted by \( x_{1}, x_{2}, ... , x_{18} \). (If the individual observations were denoted by \( y_{1}, y_{2}, ..., y_{18} \), the mean would be \(\bar{y} \).) The value of \(\bar{x} = 44.17\) is the mean age of the \(18\) women. At first sight, it may be surprising that none of the women is exactly \(44.17\) years old. But this is the normal situation with mean values.

You have certainly calculated mean values before. However, mean values are only meaningful under certain conditions.

Synopsis 3.1.1

A meaningful calculation of the mean value is only possible with quantitative variables, because the calculation of sums makes sense only for quantitative data.

In this context, take a look at the following question:

Question Nr. 3.1.1 What does the statement "Only the values of quantitative variables can be added" imply for the variable ALCOHOL in Dr. Frank N. Stein's data set?
Note: In the data set, 1 indicates daily alcohol consumption, 2 weekly alcohol consumption and 3 that alcohol is consumed seldom or never.
If we calculate the mean value with these values, we get a mean alcohol consumption of 1.6. Tick the correct statement(s):
Since the mean value of 1.6 lies between 1 and 2, we can say that daily and weekly alcohol consumption are common. Moreover, since the mean lies closer to 2, the answer `weekly alcohol consumption' is the most frequent.
Numbers may always be added. The mean can therefore be calculated without any problems.
The coding values 1, 2 and 3 stand for `daily', `weekly' and `seldom/never'. However, we are talking about ordinal data. To calculate a mean value would thus not make sense.
A mean value of 1.6 cannot be interpreted meaningfully.
The mean can be calculated in this way, because the variable ALCOHOL is quantitative.

3.1.2. Median

The mean value is a measure of the mean location of the data. Another measure of the mean location is the "median". The idea of "mean" is conceived differently in this case. The median can be derived according to the following rule:

Find a number such that fifty per cent of the data are smaller and fifty percent larger than this number

However, such a number can only exist if the sample size is even (only an even number can be divided by \(2\) !). If the sample size is uneven, then we try to approximately fulfill this condition. In this case, the following rule is applied:

We first arrange the individual values in ascending order. Then the value which lies exactly in the middle of the ordered series is identified by counting.

Question Nr. 3.1.2 Which is the median? The body weight of 7 men is given in kilograms: 69, 80, 56, 63, 67, 79, 71.
Find the median.
69 kg
73 kg
70 kg
67 kg
71 kg

What will however happen if the sample size is even, as in Dr. Frank N. Stein's 18 female patients? In this case, it is not possible to directly define the median by counting. This problem can be solved as follows:

We equate the median with the arithmetic mean of the two values lying in the center of the ordered series.

Dr. Frank N. Stein can now calculate the median of the age of his \(18\) female patients by considering the two age values which lie in the center of the ordered series \[ 32, 33, 35, 36, 37, 37, 40, 41, \boldsymbol{41}, \boldsymbol{43}, 44, 47, 47, 51, 52, 56, 58, 65 \] i.e. the \(9\)-th and the \(10\)-th value, and then calculating the arithmetic mean of these two values (i.e., \(41\) and \(43\)). In this sample, the median of age is thus \[ \tilde{x} = \frac{41 + 43}{2} = 42 \, \text{years}. \]

The median is usually denoted by \( \tilde{x} \) (read: "x tilde").

Question Nr. 3.1.3 Which is the median in this case? The body weight of 8 men is given in kilograms: 69, 80, 56, 63, 73, 67, 79, 71.
Determine the median.
69 kg
73 kg
70 kg
67 kg
71 kg

This method of determining the median corresponds to its reading from the distribution function, as it was explained in chapter 2.

Figure 3.3. Median for uneven-numbered (left) and even-numbered (right) sample sizes

In order to determine the median of the data underlying the empirical distribution function, we read the value of age for which the distribution function has a cumulative relative frequency of \(0.5\).

If we have an uneven sample size, as in the left part of figure 3.3, then \(0.5\) marks the midpoint of a step. In this case, the median can be directly read as the x-value of this step.

If we have an even sample size, as in the right part of figure 3.3, then the step function runs horizontally at the level \(0.5\). Therefore any value of age within this horizontal line segment is a potential candidate for the median. It is thus natural to take the value of weight, which cuts this horizonal line segment in half (i.e., the arithmetic mean of the two age values limiting the segment). Since we have to calculate an arithmetic mean when determining the median of even-numbered samples, a meaningful interpretation of the median is not possible for non-quantitative variables either.

Synopsis 3.1.2

Equal numbers of values lie on both sides of the median. If the sample size n is even, then this can be sharpened to \(50\%\) of the data lying on either side of the median.

3.1.3 Mean value or median?

The mean value and the median are both statistical measures for the mean location or the centre of the data. Therefore, the following questions arise:

    Which measure should preferably be used when?
    How can differences between the two measures be interpreted?

In order to answer these questions, we first have to examine how strongly the mean value and the median are influenced by extreme values. Work through the following questions

Question Nr. 3.1.4.a.1 The following observations represent blood sugar values in mg/dl of n different female patients: 77, 79, 80, 82, 83, 86, 87, 89, 95, n = 9
Determine the mean and the median.
Mean = 84.2, Median = 84
Mean = 84.2, Median = 83
Mean = 83, Median = 84
Mean = 84, Median = 83

Question Nr. 3.1.4.a.2 The following observations represent blood sugar values in mg/dl of n different female patients: 77, 79, 80, 82, 83, 85, 86, 87, 89, 95, n =10
Determine the mean and the median.
Mean = 84.8, Median = 85
Mean = 83, Median = 85
Mean = 84.3, Meddian = 84
Mean = 83, Median = 84

Question Nr. 3.1.4.a.3 The following observations represent blood sugar values in mg/dl of n different female patients: 71, 77, 79, 80, 82, 83, 86, 87, 89, n=9
Determine the mean and the median.
Mean = 81.6, Median = 82
Mean = 81.6, Median = 83
Mean = 81, Median = 83
Mean = 82, Median = 83

Question Nr. 3.1.4.a.4 The following observations represent blood sugar values in mg/dl of n different female patients: 77, 77, 79, 80, 82, 83, 86, 89, 190, n=9
Determine the mean and the median.
Mean = 94, Median = 81
Mean = 93.7, Median = 82
Mean = 93.7, Median = 82.5
Mean = 83.7, Median = 82

Question Nr. 3.1.4.b How are the mean value and the median influenced by extreme (especially large or especially small) observations? .
Both measures are sensitive to such observations.
The mean is influenced by such observations, but not the median.
If an outlying value turns out to be wrong and is corrected, then the mean value changes but not the median.
If an outlying value turns out to be wrong and is corrected, then the mean value changes.

Your findings are summarized in the following synopsis

Synopsis 3.1.3

The mean value responds very strongly to changes in the largest and smallest values of the sample, while the median is not affected by such changes. In the language of statistics, this property is referred to as "robustness" of the median.

You can convince yourself of the robustness of the median by opening the applet "Comparison of mean value and median".

Figure 3.4. Applet "Comparison of mean value and median"

Answer the following question with the help of this applet.

Question Nr. 3.1.5 What happens to the median and the mean? Move the smallest of the given values to the lower end of the scale (make it as small as possible). What happens to the median and the mean?
The mean and median become smaller.
The mean becomes smaller, the median becomes bigger.
The mean becomes bigger, the median becomes smaller.
The mean and the median stay the same.
The mean becomes smaller, the median stays the same.

You have seen that individual extreme values can strongly influence the mean. If the difference between the median and the mean is significant, this either indicates the presenece of extreme values or that the data have a skewed distribution. In such situations, the median is the more adequate measure of location. A distribution is said to be "skewed", if the spread of the values is different to the left and to the right of the median.

The decision of whether to calculate the mean or the median also depends on the underlying question. For instance, if we want to estimate the annual number of hospitalisation days from a sample of stationary stays in a county hospital, then the mean length of the hospital stays of this sample can be multiplied by the annual number of hospital stays. In this case, the mean is the right choice. However, if we are interested in a typical value, the median is more appropriate.

Question Nr. 3.1.6 Which of the following statements are incorrect?
The mean can be calculated for ordinal data.
The median can be calculated for nominal data.
The median can only be calculated for an odd (uneven) number of values.
The median cannot be calculated for continuous variables like body height or body weight.

3.1.4 Minumum and Maximum

As Dr. Frank N. Stein is also interested in the range of age of his female patients, he wants to determine the minimum and the maximum of their ages. He will find that the smallest age value in his data set is \(x_{[1]} = 32\) years, and that the largest one is \( x_{[18]} = 65\) years. In general, the \(k\)-th value of a series of quantitative data arranged in ascending order is denoted by \( x_{[k]} \). Thus, \( x_{[1]} \) is the smallest and \( x_{[n]} \) the largest of the \(n\) values. Determining the minimum and the maximum may potentially even make sense for ordinal data.

The term "robustness" was already introduced before. A robust measure is not influenced at all or only very slightly by individual very big or very small values (i.e., so-called "outliers"). The minimum and the maximum are thus everything but robust.

Question Nr. 3.1.7 How robust are maximum and minimum? The term `robustness' was already introduced along with the median. A robust measure of location is not influenced at all or only very slightly by individual very big or very small values (outliers). How robust are maximum and minimum?
Maximum and minimum are similarly robust as the median.
Maximum and minimum are not robust, i.e., they react very sensitively to outliers.
Maximum and minimum are very robust, i.e., they do not react to outliers.

3.1.5. Quartiles

We can say that the median cuts the ordered sample in half (i.e. with \(50\%\) of the data being larger and \(50\%\) being smaller than the median).

If we want to cut the ordered sample into four equal parts, we must additionally introduce the lower and the upper quartile. Ideally, the lower quartile cuts the ordered data into a lower part containing \(25\%\) of the data and an upper part containing \(75\%\) pf the data. Likewise, the upper quartile ideally cuts the ordered data into a lower part containing \(75\%\) of the data and an upper part containing \(25\%\) of the data. However, these )conditions can only be fulfilled, if \(n\) can be divided by \(4\), i.e., if the multiplication of the sample size \(n\) with \(0.25\) (or \(0.75\)) provides an integer number. If this is not the case, we have to find numbers, which approximately fulfill the conditions.

In order to calculate the lower and upper quartile, the data have to be arranged in increasing order. The ordered values are denoted by \( x_{[1]}, x_{[2]}, ..., x_{[n]}\). We thus have \[ x_{[1]} \leq x_{[2]} \leq ... \leq x_{[n]} \] To determine the lower quartile, the sample size \(n\) is multiplied by \(0.25\).

If \(0.25 \times n\) is not an integer number, we round \(0.25 \times n\) up to the next integer value. This number is denoted by \( \lceil 0.25 \times n \rceil \). The lower quartile is then the \( \lceil 0.25 \times n \rceil \)-th value of the ordered series, i.e. \( Q1 = x_{[ \lceil 0.25 \times n \rceil]} \) .

However, if \(0.25 \times n\) is an integer number, the lower quartile is defined as the mean value of the observation \(x_{[0.25 \times n]}\) and the observation \( x_{[0.25 \times n + 1]} \), i.e. \( Q1 = 0.5 \times (x_{[0.25 \times n]} + x_{[0.25 \times n+1]}) \).

The upper quartile is determined in an analogous way, with the only difference that the sample size \(n\) is multiplied by \(0.75\) instead of \(0.25\).

The two quartiles of the age of Dr. Frank N. Stein's \(18\) female patients are determined as follows: At first, the values \(18\times0.25 = 4.5\) and \(18\times0.75 = 13.5\) are calculated. Since both values are fractions, they are rounded up to the next integer: \( \lceil 4.5 \rceil = 5 \) and \( \lceil 13.5 \rceil = 14 \). Thus, the lower quartile is the fifth and the upper quartile the fourteenth value of the ordered data series, i.e., \( Q1 = x_{[5]} = 37 \) years and \( Q3 = x_{[14]} = 51 \) years.

Without the oldest and the youngest female patient, the sample size would be \(16\). The products \(16 \times 0.25 = 4\) and \(16 \times 0.75 = 12\) would then both be integer numbers. Accordingly, the first quartile would be equal to the mean value of the fourth and the fifth observation of the reduced data series, i.e., \((37 + 37)/2 = 37\), and the third quartile would be equal to the mean value of the twelfth and the thirteenth observation of this series, i.e., \((47 + 51)/2 = 49\).

Question Nr. 3.1.8 From a small patient survey, the following 8 answers on the weekly number of days with meat consumption are available: 2, 4, 6, 5, 7, 6, 4, 3.
Which values represent the upper and the lower quartile of these data?
the lower quartile is 2, the upper quartile is 7
the lower quartile is 3, the upper quartile is 5
the lower quartile is 3.5, the upper quartile is 6
the lower quartile is 2, the upper quartile is 5

The calculation of quartiles only makes sense for quantitative variables.

Synopsis 3.1.4

If \(0.25 \times n\) is an integer number, then \(25\%\) of the data lie below and \(75\%\) above the first quartile. Likewise, \(75\%\) of the date lie below and \(25\%\) above the third quartile. If \(0.25 × n\) is a fraction, then these statements hold approximately.

3.1.6. Percentiles and Quantiles

The method of determining quartiles can be generlised to any percentile or quantile. As a guide you can memorise that the first quartile corresponds to the \(25\)-th percentile or \(0.25\)-quantile, the median to the \(50\)-th percentile or \(0.5\)-quantile and the upper quartile to the \(75\)-th percentile or \(0.75\)-quantile of the data.

Let \(p\) be a number between \(0\) and \( 1\). The \(p\)-quantile or \((100 \times p)\)-th percentile of a set of numerical data is determined in a similar way as the lower and the upper quartile, just with \(p\) replacing \(0.25\) or \(0.75\). We thus multiply the sample size \(n\) by \(p\).

If the result is a fraction, then we round \(n \times p\) up to the next integer \( \lceil n \times p \rceil \) and determine the \( \lceil n \times p \rceil \)-th value in the ascendingly ordered data set.

If the result is an integer, then we take the mean of the \((n \times p)\)-th and the \((n \times p + 1)\)-th value.

If the data set is large, then about \((100 \times p) \%\) of the data will be smaller and about \((100 \times (1-p)) \%\) of the data will be larger than this value. Notice that the \((100 \times p)\)-th percentile and the \(p\)-quantile are synonyms. Common percentiles beside the median and the quartiles are defined by \(p = 0.1, 0.2, . . . , 0.9\) or also by \(p = 0.05, 0.95\).

Dr. Frank N. Stein would like to know which age is not exceeded by \(90\%\) of his \(18\) female patients. With \(18 \times 0.9 = 16.2\) the \(90\)-th percentile turns out to be the \(17\)-th value of the ordered data series, i.e., \(58\). Hence the \(90\)th percentile in his sample equals \(58\) years.

To be noted: There are alternative methods to calculate percentiles. This is why different statistics programs can provide slightly different results.

What has already been said for the median before, of course also applies to percentiles: their calculation only makes sense for quantitative variables

Synopsis 3.1.5

If \(n \times p\) is an integer number, then \(100 \times p \%\) of the data lie below and \(100 \times (1-p) \%\) above the \(100 \times p\)-th percentile or \(p\)-quantile. If \(n \times p\) is a fraction, then this statement holds approximately.

3.2. Measures of dispersion

The measures of location provide information on certain characteristics of the data set. However, a measure of location by itself is not sufficient for a meaningful summary of the data. Only the combination with a measure of dispersion can provide a comprehensive description of the data, as it is illustrated in the following example:

In the context of a stop-and-search operation by the police, car drivers are tested for their blood alcohol level directly on the road using a rapid test. If this test shows an elevated level of alcohol, the car driver has to undergo a blood test in the laboratory.

A person is now tested five times with a rapid test and five times with a blood test. The following values expressed in per mille resulted from the two tests.

    Rapid test: \(0.67, 1.17, 0.98, 0.90, 0.78\)
    Blood test: \(0.87, 0.90, 0.92, 0.91, 0.90\)

Exercise:

    a) Calculate the respective mean and the median of the two series of measurements.
    b) What do you notice?

The mean and the median are \(0.9\) per mille in both cases. Thus, the two measures of location coincide. However, if we take a look at the original data, we notice that the results of the two tests are quite different. The values of the rapid test are much more scattered than those of the blood test. This fact can be captured with a measure of dispersion.

Synopsis 3.2.1

The distribution of quantitative data can be well described with a measure of location and a measure of dispersion.

3.2.1. Variance and standard deviation

The two most important measures of dispersion are the "standard deviation" and the "variance". The standard deviation measures the deviation of the individual values from the mean. Computationally, however, the variance is its precursor.

If a policeman wants to calculate the variance of the values from the rapid test, he has to square the differences of the individual observations from the mean of the sample and then sum up the squared differences. The sum of the squared differences is then divided by the number of observations \(n\) minus \(1\), hence by \(n-1\). The mathematical formula of the variance is \[ \text{variance} = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + ... + (x_n - \bar{x})^2}{n - 1} \] The differences are squared in order to only get positive summands. When just summing up the differences as they are, the negative and positive deviations from the mean cancel each other and the sum becomes \(0\).

Why do we divide the sum of squared differences by \(n-1\) and not by \(n\)?

For a simple explanation let us consider a sample of size \(1\) whose only value is \(x_1\). Then, \( \bar{x} = x_1 \) and thus \( x_1 - \bar{x} = 0 . \) If we divided the square of this difference by \(n = 1\), we would obtain \[ \text{variance} = \frac{(x_1 - \bar{x})^2 }{1} = \frac{0}{1} = 0. \] However, if we divide by \(n-1 = 0\), then we obtain \[ \text{variance} = \frac{(x_1 - \bar{x})^2 }{0} = \frac{0}{0}, \] which is undefined. This is the only meaningful result for the variance in a situation with one single observation, where we do not have any information on the dispersion of the respective variable.

In order to calculate the variance it is convenient to generate a table as follows :

Table 3.2: Calculation of the variance of the rapid test results
Values of the rapid test Differences to the mean (=0.90) Squared differences
0.67 -0.23 0.0529
1.17 0.27 0.0729
0.98 0.08 0.0064
0.9 0.00 0.0000
0.78 -0.12 0.0144
sum 0.00 0.1466

The variance of the measurements of the rapid test thus equals \(0.1466/4 = 0.037\) per \(10^6\). With the help of an analogous table, the laboratory technician finds that the variance of the blood test measurements equals \(0.0004\) per \(10^6\).

After having computed the variance, the standard deviation is obtained by taking the square root of the variance \[\text{standard deviation} = \sqrt{\text{variance}} \]

We obtain a standard deviation of \(0.19\) per mille for the rapid test measurements and a standard deviation of \(0.02\) per mille for the blood test measurements.

By taking the square root of the variance, we get back to the original units (i.e., per mille).

The two standard deviations show that the values of the blood test scatter much less around the mean than those of the rapid test. Hence the blood test seems to provide more accurate and reproducible measurements than the rapid test. Since the mean and the standard deviation always have the same unit (per mille in our case), the standard deviation plays a more important role in descriptive statistics than the variance. However, the variance will play an important role in some later chapters.

With analogous calculations, Dr. Frank N. Stein can determine the variance of the age of his female chemical company employees. Its value equals \(85.56\) years\(^2\), providing a standard deviation of \(9.25\) years.

The following table gives the values of height of the \(18\) female patients of Dr. Stein

Table 3.3: Height of the \(18\) female patients
Patient Height (cm) Patient Height (cm)
1 152 10 163
2 152 11 164
3 153 12 165
4 155 13 167
5 160 14 167
6 160 15 169
7 162 16 170
8 162 17 171
9 163 18 172

Question Nr. 3.2.1 Which are the values of the variance and the standard deviation of the body height of the 18 women (cf. Table 3.3.)?
Hint: The mean value is 162.61 cm. Use your calculator to answer this question!
38.35 cm2 and 6.19 cm
502405.11 cm2 and 708.81 cm
0 cm2 and 0 cm
40.6 cm2 and 6.37 cm
32.7 cm2 and 5.72 cm

3.2.2. Range and interquartile range

If Dr. Frank N. Stein wants to find the difference in age between the youngest and the oldest woman in his sample, he must calculate the difference between the maximum and the minimum. This difference is called "range". The range of his female patients' age is \(65 - 32 = 33\) years.

Another measure of dispersion is the "interquartile range". This is the difference between the upper and the lower quartile. As the upper quartile of age equals \(51\) years and the lower quartile \(37\) years in the data set of the \(18\) female patients, the interquartile range equals \(51 - 37 = 14\) years.

The interquartile range measures the spread of the central \(50\%\) of the data. It is a more robust measure of dispersion than the standard deviation, as it is not influenced by very big or very small values. The range is the least robust measure of dispersion, as it only depends on the minimum and the maximum.

The interquartile range and the range are both represented in the boxplot. The interquartile range coicnides with the length of the box, while the range equals the length of the entire plot.

Synopsis 3.2.2

The range measures the overall spread (width) of the data. The interquartile range measures the spread (width) of the central fifty percent of the data.

Table 3.4: Summary of the statistics of age of Dr. Frank N. Stein's \(18\) female patients
Name Symbol Value
measures of location
mean \( \bar{x} \) 44.17 years
minimum \( x_{[1]} \) 32 years
lower quartile \( Q_1 \) 37 years
median \( \tilde{x} \) 42 years
upper quartile \( Q_3 \) 51 years
maximum \( x_{[18]} \) 65 years
measures of dispersion
variance \( s^2 \) 85.56 years\(^2\)
standard deviation s 9.25 years
range \( x_{[18]} - x_{[1]} \) 33 years
interquartile range IQR = \( Q_3 - Q_1 \) 14 years

The following questions concern table 3.3.

Question Nr. 3.2.2 Which is the smallest observed value of body height (minimum)?
152 cm
152.5 cm
151 cm

Question Nr. 3.2.3 Which is the lower quartile of the observed values of body height?
155 cm
165 cm
160 cm

Question Nr. 3.2.4 Which is the median of the observed values of body height?
162.61 cm
162 cm
163 cm

Question Nr. 3.2.5 Which is the upper quartile of the observed values of body height?
169 cm
167 cm
155 cm

Question Nr. 3.2.6 Which is the maximum of the observed values of body height?
172 cm
173 cm
152 cm

Question Nr. 3.2.7 Which is the mean of the observed values of body height?
163 cm
163.5 cm
162.6 cm

Question Nr. 3.2.8 Which is the range of the observed values of body height?
2 cm
20 cm
7 cm

Question Nr. 3.2.9 Which is the interquartile range of the observed values of body height?
20 cm
3 cm
7 cm

3.3. Boxplots and statistical measures

We can get a good overview of the distribution of sampled data by means of a boxplot, in which the following measures of location are represented: minimum, lower quartile, median, upper quartile and maximum. The boxplot was already treated in chapter 2. In the following figure, the age of the \(18\) female employees is represented in a boxplot

Figure 3.5. Boxplot of the \(18\) female employees' age

However, we cannot only read the five measures of location from this graph, but also the range and the interquartile range. The range is the distance between the two ends of the plot and the interquartile range is the length of the box.

With the help of a boxplot, it can also be assessed whether the data are distributed symmetrically or not. If the vertical line, which represents the median, divides the box into equal parts, then this is an indication for a symmetrical distribution. If the left part of the box is shorter than the right part, this indicates that the data are more widely scattered above the median than below it. In this case, we speak of a "right-skewed" distribution. In the converse situation (i.e. if the left part of the box is wider than the right part), we speak of a "left-skewed" distribution.

The following boxplot represents the body height of Dr. Frank N. Stein's male chemical employees.

Figure 3.6. Boxplot of the \(103\) male employees' body height

Question Nr. 3.3.1 The boxplot represents the body height of Dr. Frank N. Stein's male chemical company employees. Which statements are correct for this boxplot?
The represented measures are 159, 170, 175, 179 and 193 cm.
More than 50% of the observations have a value which is bigger than 173 cm.
About 50% of the observations lie between 170 and 179 cm.
The data are distributed symmetrically, i.e., the data are scattered equally to the right and the left of the median.
The data distribution is slightly right-skewed, i.e., the scatter of the data is wider above than below the median.

3.4. Summary

Statistical measures (statistics) can be divided into "measures of location" and "measures of dispersion". Measures of location provide information about the location of the data.

In order to describe the centre of the data, the mean value or the median are used. If the distribution of the data is skewed, the median is generally preferred.

More general measures of location are the percentiles. The \((100 \times p)\)-th percentile divides the data into the \((100 \times p)\%\) smallest and the \((100 \times (1-p))\%\) largest values. Special cases of percentiles are the median (\(50\)-th percentile), the lower quartile (\(25\)-th percentile) and the upper quartile (\(75\)-th percentile).

Measures of dispersion describe the scatter of the data. The scatter of the data around the mean value is measured by the "standard deviation". The wider the scatter of the data around the mean value, the larger the standard deviation. Other measures of dispersion are the "interquartile range", which measures the spread of the central fifty per cent of the data and the range, which measures the spread of the whole sample.

In a boxplot, the following seven statistical measures represented: minimum, lower quartile (or \(25\)-th percentile), median, upper quartile (or \(75\)-th percentile) and maximum (measures of location), range and interquartile range (measures of dispersion). Occasionally, the mean is additionally represented as a dot.

All of these measures are only meaningful for quantitative variables. When dealing with qualitative variables, the relevant statistical measures are the relative frequencies of the different values (or categories) of the respective variables.

Figure 3.7. Scheme of chapter 3 with gaps