Chapter 2 Description of data

This chapter introduces instruments, with which we can clearly illustrate, how often the values of a variable occur in a data set, respectively how the data are distributed.

Educational objectives

In this chapter you will get introduced to important means of data visualisation like the bar chart, the histogram and the scatter plot. After having worked through this chapter you will know which illustration method to choose for which type of data and you will be able to interpret these graphical representations.}

Key words: frequency table, absolute frequency, relative frequency, bar chart, pie chart, histogram, empirical distribution function , boxplot, scatter plot).

Previous knowledge: variable, type of variable, value

Central questions: How can data be represented and illustrated in a clear way? Which illustration methods are adequate for each particular data type?

2.1. At a glance

The data set of Dr. Frank N. Stein includes the data of \(121\) chemical company employees. To overlook and process these data, he has to illustrate them in a meaningful way. But how?

If he shows a table with numerical values of individual patients at a presentation, neither he himself nor the audience will understand the content of these numbers.

This is why the data should be summarised in a frequency table or illustrated in a figure. With a figure one can understand important characteristics of the data at a glance. For instance, we can see right away in which range the data lie.

Of course it is important to find an adequate illustration for the data at hand. You will learn about this in the following sections.

2.2. Summarising data in a frequency table

In the patient file, the patients' gender was recorded with the values \(1\)="male" and \(2\)="female". We would now like to know the number of women and men among the chemical company employees attended by Dr. Frank N. Stein. To do so, we count how often each value or category appears. The two resulting numbers, also called "absolute frequencies", are represented in a table.

Discrete variables, which do not have too many values, can be illustrated in an easy and compact way like this. First of all the absolute frequencies of the different values have to be determined. Then we can also calculate the so-called "relative frequencies". We get these by dividing the absolute frequencies by the sample size. The sum of relative frequencies of all values is therefore always \(1\). Relative frequencies are often also expressed as percentages. For this, we have to multiply the original values of the relative frequencies by \(100\). For instance, a relative frequency of \(0.2\) corresponds to \(20\%\). Another important term to be introduced is the "frequency distribution" of a variable.

Definition 2.1

The pattern of relative frequencies of a variable is referred to as frequency distribution.

For Dr. Frank N. Stein's sample, the frequency table of the variable gender looks as follows:

Table 2.1: Frequency table of gender
Category	Absolute frequency	Relative frequency	Percentage
male	103	0.851	85.1
female	18	0.149	14.9
Total	121	1.0	100

From this table we can read that Dr. Frank N. Stein examined \(103\) men and \(18\) women, which gives a total of \(121\) examined employees. The relative frequency of men is therefore \(103/121 = 0.851\) and the one of women is \(18/121 = 0.149\). We can hence say that \(85\%\) of his patients are men and \(15\% \)women.

The purpose of a "frequency table" is to summarise and illustrate data in a concise way. However, this is only possible if the examined variable does not take too many values. For variables with a large number of values, the table becomes unreadable. Such data should preferably be represented graphically, since we can better capture and process complex information visually.

Synopsis 2.2

A frequency table is useful to summarise discrete data with few values.

Question Nr. 2.2.1 For which of the variables of Dr. Frank N. Stein's data set does it make sense to create a frequency table?
	gender (male, female)
	blood sugar (in mg/dl)
	cholesterol (in mg/dl)
	systolic blood pressure (in mmHg)
	diastolic blood pressure (in mmHg)
	health condition (good, moderate, bad)
	appetite (increased, normal, reduced)
	hypertension (no, yes, questionable, not relevant or not known)
	history of myocardial infarction (no, yes, questionable, not relevant or not known for sure)
	diabetes (no, yes, questionable, not relevant or not known for sure)
	alcohol consumption (daily, weekly, seldom/never)
	skin affections (no, yes, questionable)
	intake of cardiac medication (no, yes, at times, not relevant)
	use of tranquilizers (no, yes, at times, not relevant)
	height (in cm)
	body weight (in kg)
	hallux valgus (no, left, right, both sides, not relevant)
	resting ECG (normal, pathological, questionable, not recorded, conduction disturbances, not evaluable)
	evaluation of the resting ECG (not evaluable, normal, definitely ischaemic, potentially ischaemic, other anomalies)
	haemoglobin (in g/dl)
	number of leukocytes (per μl)
	liver function parameter GPT (in U/l)
	creatinine (in mg/dl)
	age (in years)

2.3. Summarising data graphically

2.3.1. Bar charts

We now also want to describe the variable gender graphically, in order to get a visual impression of the frequency of the values, i.e., of their distribution.

We use the so-called "bar chart" to graphically represent discrete data, no matter if qualitative or quantitative.

Figure 2.2. Bar chart of the gender of the \(121\) patients

A bar chart consists of bars of equal width centering around the respective values without touching each other. Most often, the height of the bars represents the absolute frequency of the values. Alternatively, they can also represent the relative frequencies. This has no effect on the visual impact. In the bar chart above, the absolute frequencies are represented. With each bar chart, the total number of the underlying observations should be indicated.

You may now open the applet "Bar chart" and draw bar charts for different variables of Dr. Frank N. Stein's data set.

The following variables are included in this applet

variable name	values of the variable
sex	male, female
age	in years
body height	in cm
body weight	in kg
general well-being	good, moderate, bad
appetite	reduced, normal, increased
alcohol (alcohol consumption)	daily, weekly, rarely/never
myocardial infarction	yes, no, unclear, missing

For which of these variables is the distribution appropriately represented by a bar chart?
You can draw the bar charts of these variables for the 121 patients registered in the data base of Dr. Frank N. Stein.

Use the applet to answer the following questions
-

Caution: Discrete variables with a large number of possible values should be better treated as if they were continuous. For instance, if we count the number of trees per square kilometre, then any value between \(0\) (tree-less land) and close to \(100,000\) (dense forest) may occur. Although we are dealing with discrete count data in this case, it does not make sense to illustrate them in a bar chart, as the number of bars would be huge and the data would be badly summarised.

Synopsis 2.3.1

Bar charts serve to illustrate discrete data with few values.

2.3.2 Pie charts

Now Dr. Frank N. Stein would like to graphically illustrate the variable "Assessment of the resting ECG" with the abbreviation RBURTEI. The variable RBURTEI distinguishes the following categories: ` "undecided", "normal", "ischaemic origin certain", "ischaemic origin unclear" and "other anomalies".

He could of course also choose a bar chart to visualise this variable. However, as none of the categories has a very small relative frequency, the variable RBURTEI can also be illustrated in a "pie chart".

Figure 2.4. Pie chart of the variable RBURTEI of the \(121\) patients

Each sector of the circle represents the relative frequency of the respective category. The total sample size and the absolute frequencies should also be indicated in this case. A pie chart is used if we want to visually compare relative frequencies. However, if some of the frequencies are similar, a bar chart may be more appropriate.

Synopsis 2.3.2

If a nominal variable has few categories with frequencies which are not too small, then a pie chart may be used to compare the relative frequencies of the categories.

2.3.3. Histogram

We would now like to know the frequences of the different values of age in Dr. Frank N. Stein's sample of chemical company employees. Since the variable has many values, a bar chart will not be appropriate to viusalise the frequency distribution of these values. However, the data may be divided into categories. For this, the continuous measuring scale is divided into intervals of equal width, and the number of observations falling into each of the intervals is counted. However, it is important that the number of intervals is neither too large nor too small.

There are various recommendations regarding the number of categories. A simple recommendation says that \(6\) to \(20\) intervals should be created. Hence \(5\)-year intervals are suitable for the classification of the chemical company employees' age. The respective frequencies are shown in the following table.

Table 2.2: Frequencies of different age categories among the \(121\) patients (age in completed years)
Age interval (years)	Absolute frequency	Relative frequency
25 - 29	1	0.008
30 - 34	7	0.058
35 - 39	11	0.091
40 - 44	18	0.149
45 - 49	25	0.207
50 - 54	17	0.140
55 - 59	18	0.149
60 - 64	9	0.074
65 - 69	11	0.091
70 - 74	3	0.025
75 - 79	1	0.008

The frequencies of the age categorie are now visualised in a so-called "histogram".

Figure 2.5. Histogram of the age of the 121 patients

In the histogram of the variable AGE, we can see that the age of the chemical company employees in Dr. Frank N. Stein's data set ranges from \(25\) to \(79\) years. Notice that the chemical industry group also requires and pays the examination of retired employees. The highest frequency appears in the interval of \(45\) to \(49\) years, and one can roughly say that a majority of employees are between \(40\) and \(60\) years old.

The histogram resembles a bar chart. However, as there are no natural spaces between the values of a continuous variable, there are no gaps between the bars in a histogram. The base line of each bar coincides with the interval that it represents, and its height is defined by the frequency of the respective category. The absolute or the relative frequencies can be displayed in a histogram - only the scaling of the \(y\)-axis will be different. Notice that observations falling on the boundary between two intervals are assigned to the upper interval.

Caution: Histograms are mostly drawn for intervals of equal width. This is not compulsory though. However, if intervals of different length are chosen, then the areas and not the heights of the bars should be proportional to the frequencies. In this case, the \(y\)-axis represents the frequency density of values, i.e., the frequency of values in the respective interval divided by the length of the interval.

Synopsis 2.3.3

The histogram serves to visualise the frequency distribution of continuous variables or of quantitative discrete variables with a large number of values.

Figure 2.6. Histogram of the height of the \(103\) male employees

With the help of the following applet you can test the statements of the previous question and examine various class widths and numbers of classes.

With this applet you can see how the shape of the histogram depends on the class width. The class width can be changed by moving the green ruler. For this, you must position the cursor at the end of the ruler and press the left mouse key to drag the ruler with the cursor.

2.3.4 The empirical distribution function

Dr. Frank N. Stein is still occupied with the illustration of the variable AGE. In addition to the histogram, he now also illustrates the variable sith its "empirical distribution function".

Figure 2.8. Empirical distribution function of the age of the 121 patients

The empirical distribution function provides an answer to the question "Which proportion of observations is smaller or equal to a given value?". One such question might be: "What proportion of chemical company employees in Dr. Frank N. Stein's data set are less than or exactly \(60\) years old?".

In order to illustrate how the graph was constructed, some values are listed in the following frequency table.

Table 2.3: Absolute, relative and cumulative relative frequencies of age
Age (yr)	abs. freq.	rel. freq.	cumulative rel. freq.
29 (yr)	1	1/121 ≈ 0.008	0.008
31 (yr)	2	2/121 ≈ 0.017	0.025
32 (yr)	1	1/121 ≈ 0.008	0.033
...	...	...	...
74 (yr)	1	1/121 ≈ 0.008	0.992
76 (yr)	1	1/121 ≈ 0.008	1.00

In this table, the observed values of age are represented in the column "age" in increasing order. The values in the column "absolute frequency" indicate how often the respective value of age was observed. The relative frequencies are calculated by dividing the absolute frequencies by the sample size \(121\).

The values in the column "cumulative relative frequency" are calculated as the sum of the relative frequency of the respective age and the relative frequencies of all age values below. In the case of "\(32\) years", the cumulative relative frequency is thus \(0.008 + 0.017 + 0.008 = 0.033\). This means that \(3.3\%\) of the persons in the sample were \(32\) years old or younger.

The empirical distribution function of a quantitative variable \(X\) (i.e., age in our case) is obtained by plotting the cumulative relative frequencies against the respective observed values of \(X\), and by connecting the resulting points with steps rising at the respective values \(X\). The height of each step is equal to the relative frequency of the respective value of \(X\).

Such a function is also called "step function". The empirical distribution function starts at the level \(0\) below the smallest observed value of \(X\) and ends at the level \(1\) at the largest observed value of \(X\).

If we take a look at the empirical distribution of age, we notice that the step function shows the steepest incline in the range between \(40\) and \(60\) years. This indicates that there is a concentration of employees aged between \(40\) and \(60\) years. At the right and left end, the slope declines rather rapidly, as a consequence of the few observations at the extremes.

The question asked in the beginning "How many employees of this chemical industry group are no older than \(60\) years?" can now be answered by drawing a vertical line into the graph at the age of \(60\) years. At the point where the line cuts the step function, we can read the value of the cumulative relative frequency.

Figure 2.9. Empirical distribution function of the age of the \(121\) patients with the answer to the question "What percentage of the \(121\) patients are no older than \(60\) years?"

From this figure, we get a cumulative relative frequency of approximatively \(0.8\), meaning that approx. \(80\%\) of all chemical company employees are not older than \(60\) years .

Another question might be "Until which age are you still among the younger employees?". To answer it. we first define the category of "younger employees" as the younger half of all employees, implying that \(50\%\) belong to the younger age group and the remaining \(50\%\) to the older group. Then we draw a horizontal line through the point on the \(y\)-axis with a cumulative relative frequency of \(0.5\) and determine the point where this line intersects the step function. The \(x\)-coordinate of this cutpoint is referred to as "median" (cf. next section).

Figure 2.10. Empirical distribution function of the age
of the \(121\) patients with the answer to the question
"What is the median age of the \(121\) employees?"

The \(x\)-value of this intersection point is approximately at an age of \(50\) years. From this we can conclude that approx. \(50\%\) of the chemical company employees are no older than \(50\) years. This is where the area of the histogram in figure 2.5 is cut in half. The answer to the above question is that an employee belongs to the youger half if he/she has not yet passed the age of \(50\) years.

Synopsis 2.3.4

The empirical distribution function provides a complete description of the frequency distribution of the values of a quantitative variable observed in a sample.

Open the following applet and try to establish a situation where the distribution function follows roughly a straight line between the minimum and the maximum. For this, you will have to add and/or move points on the \(x\)-axis. Of course, you will never obtain an exact straight line with a finite number of points.

Figure 2.11. Applet for the empirical distribution function (based on the example of cholesterol)

With this applet you can see how the distribution function changes when you move red points with the cursor. By clicking on the x-axis, you can set new points, and you can remove points by moving them beyond the end of the line.

Now restore the original state of the applet and answer the following questions.

2.3.5. Boxplots

Instead of a histogram, the continuous variable AGE can also be visualised using a "boxplot".

Figure 2.12. Boxplot of the age of the \(121\) patients

In a boxplot, the data are reduced to five statistics (or statistical measures). These statistics are

a) minimum

b) lower quartile

c) median

d) upper quartile

e) maximum.

These five statistics can be derived from the empirical distribution function.

First of all we determine the lower quartile of age, i.e. the value of age, at which the cumulative relative frequency equals \(0.25\) in the empirical distribution function. Approximately \(25\%\) of the \(121\) chemical company employees have an age under this value (of approx. \(43\) years) and about \(75\%\) have an age above this value.

The median corresponds to the value of age which divides the employees into two groups of equal size. This is the value with a cumulative relative frequency of \(0.5\). We have already determined this value before.

The upper quartile is the value of age which has a cumulative relative frequency of \(0.75\). How these three statistics are determined is shown in the following figure. The minimum and the maximum are represented by the first and the last step, respectively.

Figure 2.13. Derivation of the boxplot from the empirical distribution function

The "box" of the boxplot ranges from the lower to the upper quartile. The line in the box represents the median. The central \(50\%\) of the data lie inside the box. The lower and the upper endpoint outside the box represent the minimum and the maximum of the data. A trained person can judge the location, dispersion and symmetry of the data with the help of a boxplot. You will learn more about these terms in chapter 3.

Synopsis 2.3.5

The boxplot serves to visualize the frequency distribution of quantitative variables with many values.

Figure 2.14. Boxplot of the height of the \(103\) male patients

Large amounts of data can be summarised in a compact and concise way with the help of boxplots. Boxplots are very convenient to illustrate several distributions from different samples or groups next to each other. We use this fact to compare the body height of men and women.

Figure 2.15. Boxplot of the height of the \(121\) patients according to their gender

From the two boxplots it can be readily seen that a majority of women in Dr. Stein's sample are shorter than a majority of men.

Modified boxplot

In a modified boxplot, the lines outside the box, the so-called "whiskers", are only drawn up to the minimum and the maximum, if these two values lie within the so-called "fences".

The fences are defined as the two points whose distance from the box is \(1.5\) box lengths \(d\). The value of the lower fence equals \[\text{lower quartile} - 1.5 \times d ,\] and the value of the upper fence equals \[\text{upper quartile} + 1.5 \times d ,\]

If the minimum (maximum) lies outside the fences, then the respective whisker is only drawn to the lowest (highest) value within the fences, and all values lying outside the fences are represented as individual points.

As an example, the following modified boxplot illustrates the height of the \(121\) employees ( even though the two genders should be treated separately). In this plot, the lower and upper fence, as well as the box length \(d\) and the distance \(1.5 \times d\) between the fences and the box are added.

Figure 2.16. Modified boxplot of the height of all \(121\) patients with the fences drawn in

If we draw separate boxplots for men and women, only the boxplot of men shows a point outside the fences.

Figure 2.17. Modified boxplot of height according to gender

One remark to conclude: By default, most statistics programs do not display horizontal, but vertical modified boxplots.

2.3.6. Scatter plot

We often hear that overweight people are more likely to have high levels of cholesterol than normal weight people. Even though a single high measurement of cholesterol in the blood has a limited significance, Dr. Frank N. Stein wants to examine this claim in the data of the chemical company employees. In order to identify a possible correlation between the weight and cholesterol level, he visualises the variables WEIGHT and CHOLEST of the \(103\) men together in a so-called "scatter plot".

Figure 2.18. Scatter plot of the weight and cholesterol level of the \(103\) male employees

In this scatter plot, the weight of the chemical company employees is shown on the \(x\)-axis and their cholesterol values on the \(y\)-axis. This produces one point for each employee in the diagram.

If we take a look at the resulting scatter plot, we can recognise a slight trend that heavier individuals tend to have higher cholesterol levels. Whether this represents a true correlation between the two variables or just a pattern that might equally well have occurred by chance, would have to be tested formally (cf. chapter 14).

Synopsis 2.3.6

A scatter plot is used to visualise the jointly measured values of two continuous variables. In a scatter plot, the correlation between two variables can be observed.

2.4. Summary

Each category of variable requires another kind of description.

Nominal data can be summarized in frequency tables, bar charts or pie charts, ordinal data in frequency tables and bar charts. The illustration in a pie chart, however, is only meaningful if the variable does not have too many values and if there are no values with a small relative frequency.

Discrete quantitative data with few values are treated like ordinal data.

The most common graphical representations of continuous quantitative data or of discrete quantitative data with many values are the histogram and the boxplot. However, such data can also be visualised by their empirical distribution function, even though this method rather serves to answer specific questions (see chapter 3) than to provide an overview of the data.

If individual values are divided into intervals, frequency tables can also be used to describe quantitative variables.

The scatter plot is used to show the correlation between two quantitative variables (while all other discussed graphs only describe single variables).

Figure 2.19. Scheme of chapter 2 with gaps

Question Nr. 2.0.1 What do we need graphic illustrations for?
	To get a picture of the distribution of the data.
	To judge the location of the data.
	To get an overview of the data.
	To detect patterns in the data.
	To visualise the spread of quantitative variables.
	To enable a visual comparison of data sets.
	To summarise the data.
	To judge whether a quantitative variable has a symmetrical distribution or not.

Question Nr. 2.3.1 Which rules must be followed when drawing a bar chart? Tick the correct statements.
	The illustrated variables should be continuous.
	All bars should have the same width.
	All bars should have the same height.
	There should be a gap between the bars.
	The bars should be the same colour.

Question Nr. 2.3.2 Why should there be a gap between the bars?
	Because it looks prettier.
	Because we are dealing with continuous data.
	Because we are dealing with discrete data.
	Because there is no relation between the categories.
	Because the categories have a certain distance from each other.

Question Nr. 2.3.3 How many chemical company employees state that their health condition is bad?
	less than five
	more than half
	exactly six

Question Nr. 2.3.4 How do you judge the statement ''Many of the chemical company employees have already had a myocardial infarction''?
	This statement is correct.
	This is not clear from the data.
	This statement is false.

Question Nr. 2.3.5 How many chemical company employees consume alcohol on a daily basis?
	more than half
	about 30 persons
	no one

Question Nr. 2.3.6 Why must there not be gaps between neighboring bars of a histogram?
	Because the numbers covered by the lower (resp. upper) bar may reach up (resp. down) until the limit between the two bars.
	Because the histogram would not look sufficiently compact otherwise.
	Because we are dealing with continuous data.
	Because values of continuous variables can lie arbitrarily close to each other.
	Because the bars would look to isolated otherwise.

Question Nr. 2.3.7 What can be said about the distribution of the height of the male chemical company employees based on the histogram? Look at the histogram in Fig. 2.6, which represents the height of male chemical company employees.
	The employees are between 155 cm and 195 cm tall.
	The highest frequency appears in the interval from 170 cm to 175 cm.
	The mean value of height is 174.3 cm
	There are more short than tall employees.
	A large proportion of the employees is between 165 cm and 185 cm tall.
	The distribution of the data is approximately symmetrical, i.e., the left and the right hand part of the histogram are about symmetrical to each other.
	There are only few employees who measure more than 185 cm.

Question Nr. 2.3.8 What happens if we choose a very small width for the classes of a histogram?
	Only little information of the original data gets lost.
	There are only very few classes which summarise the data well.
	The histogram becomes very regular.
	The histogram is very busy and does not summarise the data well.
	There are lots of classes and the class frequencies fluctuate.

Question Nr. 2.3.9 Which of the following situations results in an empirical distribution function which approximates a continuous, diagonal line, beginning in the lower left corner and ending in the upper right corner?
	Such a distribution function is not possible.
	A situation with only a few measured values, which mostly lie close to the maximum and close to the minimum.
	A situation with a large number of measured values, which are evenly distributed between the minimum and the maximum.
	A situation with few measured values, which are evenly distributed between the minimum and the maximum.
	A situation with a large number of measured values, the majority of which are very small.
	A situation with only one measured value.

Question Nr. 2.3.10 Which value of cholesterol divides the data in two halves?
	approximately 200 mg/dl
	approximately 210 mg/dl
	approximately 220 mg/dl

Question Nr. 2.3.11 Which value of cholesterol divides the data in a lower fourth and three upper fourths?
	approximately 200 mg/dl
	approximately 190 mg/dl
	approximately 210 mg/dl

Question Nr. 2.3.12 Which values represent the minimum and the maximum of the cholesterol data?
	132 and 331 mg/dl
	157 and 317 mg/dl
	120 and 340 mg/dl