Chapter 1

Fundamental terms

In this chapter you are introduced to the fundamental terms of statistics, as for example "population", "sample" and "variable". These fundamental terms are the basic elements of the language of statistics, they are indispensable for the comprehension and wording of statistical problems and for the application of statistical methods.

Educational objectives

After having worked through this chapter you will know the fundamental terms of statistics. You will be able to describe the population from which a sample was drawn and the observational units and variables involved in a sample. You can classify variables based on their values.

Key words: population, sample, observational unit, variable, value, type of variable

Figure 1.1. Scheme of chapter 1

1.1. The health of chemical company employees

A chemical company in Basel asks its employees to undergo a medical examination every five years. It has closed contracts with \(26\) medical practitioners, who agreed to examine the patients at a fixed rate paid by the company. The employees are randomly allocated to the medical practitioners.

The practitioners have agreed to collect a set of basic data on the employees, including

  • gender
  • weight
  • height
  • age
  • blood pressure
  • blood glucose level
  • cholesterol level
  • haemoglobin
  • leukocyte count
  • past myocardial infarction
  • level of appetite
  • etc.
and to make these data available to the company in an anonymized form, so that the company can keep an overview of the state of health of its employees.

One of these medical practitioners, Dr. Frank N. Stein, would like to get an idea of the employees' state of health himself. He decides to describe and summarize the data of his allocated patients with the help of statistics. First of all, this requires a list of the variables and of the values these variables take in each of the observational units. But what do the terms "observational unit", "variable" and "value" mean?

1.2. Observational units, variables and their values

1.2.1. Observational units

In the following, the term "patient file" will be used for the data of the chemical company employees, who were allocated to Dr. Frank N. Stein. In this case, the patients are the observational units. In general, subjects or objects on which data are collected are called "observational units". Depending on the research question, observational units can be persons, objects or also abstract entities (e.g., time intervals).

Question Nr. 1.2.1 The chemical company of Dr. Frank N. Stein's patients produces vaccine ampoules. By means of statistical methods it must be ensured that the concentration of the active ingredient of the delivered ampoules remains constant. What do you suggest as the observational unit?
the whole production facility
the individual ampoule
the composition of the vaccine
an individual employee of the company, who is directly involved in the production of the vaccine
an individual patient treated with the vaccine

Question Nr. 1.2.2 In a study by the same company, it shall be examined, if a newly developed drug against stomach trouble is better than the existing drug. What do you suggest as the observational units?
all patients with stomach trouble
the physicians attending the patients participating the study
the stomachs of the physicians attending the patients in the study
the individual patients with stomach trouble participating in the study
the stomachs of the patients participating in the study

1.2.2. Variables

The data of the patients, which are available in the patient file, include height, weight, age, gender, blood pressure, blood sugar, cholesterol and haemoglobin. These characteristics are called "variables" in statistics.

Question Nr. 1.2.3 Which variables are of interest if the growth of infants is to be examined. For which of the following variables would it make sense to be collected during the course of such a study?
type of bed in which the infant sleeps
length of the infant
weight of the mother
heart rate of the infant
blood pressure of the infant
gripping reflex of the infant
the amount of hair the infant has
head circumference of the infant
the number of baby-soothers the infant needs
weight of the infant

1.2.3. Values

As its name tells, a variable must have more than one possible value. For instance, "female" and "male" are the values of the variable "gender". Moreover, alcohol consumption of the chemical emplayees attended by Dr. Frank N. Stein is recorded using the three values (or categories) "daily", "weekly" and "seldom/never".

Caution: In common language, the statement "she/he has blue eyes" is often interpreted as a person's characteristic. However, in statistics, "blue eyes" is the value of the variable or characteristic "eye colour" in these persons. The terms "variable" (or "characteristic") and "value" (or "category") must not be confounded with each other in statistics!

1.3. Population and sample

1.3.1. Population

In general, the totality of all potential observational units, on which one would like to collect information, is called "population". The definition of a population must be delimited in time and space and by factual conditions, and it depends on the question of interest.

For instance, the population of chemical employees from which the patients of Dr. Frank N. Stein were selected is constrained by the fact that

  • a) only current employees (factual condition)
  • and
  • b) only employees living in Basel (spatial condition)
are considered.

If we assess the variables of interest in all units of a population, we speak of a "census".

1.3.2. Sample

In most cases, it is not possible or not justifiable to carry out a census. Instead, a subset of observational units is selected from the respective population. Such subsets are called "samples". They should ideally consist of observational units which were randomly drawn from the population.

The terms "randomness" and "randomly" will repeatedly appear in the following sections and chapters, because they are a central element of statistics. They will be defined in more detail in chapter 4. Since randomness is omnipresent in our lives, we all have our own concept of it, which reflects our individual experiences. We regard things as being random if we cannot predict them.

When being confronted with random phenomena, we try to anticipate their various possible outcomes. At best, however, we know how frequently, resp. with which probability each of these outcomes occurs. This is why we can never predict if a coin falls on heads or tails, but we can assume that both outcomes are equally likely, i.e., that they appear equally often, on average.

Neither can we predict which individuals will be included in a random sample. However, we can perform the selection procedure in such a way that each individual of the population has the same chance to be included in the sample.

Unfortunately, the ideal case of a random sample is rarely possible in Medicine. In most cases, it is not possible to randomly select subjects from a registry covering an entire patient population. Medical studies are often based on so-called "convenience samples", which may consist of patients being examined during a certain time period.

Table 1.1: Summary of the fundamental terms
Term Definion
population the totality of all potential observational units of interest
sample (random) selection of observational units from the population
observational unit single subject, object or entity, on which data are collected
variable variable or characteristic of interest, which is measured or observed in observational units
values the possible values of a variable or categories of a characteristic

1.4. Categories (types) of variables

If we take a look at the variables gender, alcohol consumption and body height of Dr. Frank N. Stein's data set, we immediately notice differences. Gender and alcohol consumption both have a limited number of values, but the values of alcohol consumption, i.e., "daily", "weekly" and "seldom/never" have an inherent natural order.

On the other hand, body height can take on any value between, let's say, \(100\) and \(220\) cm (in rare cases values outside this range can be observed). In principle, body height can thus take on an infinite number of values. Variables can thus be divided into different categories or data types.

1.4.1. Qualitative variables

Variables such as gender and alcohol consumption in Dr. Frank N. Stein's data set represent qualitative data. The values of gender have no quantitative aspect at all, and those of alcohol consumption do not represent specific quantities (even though they refer to the frequency of alcohol consumption). Qualitative data can additionally be divided into nominal and ordinal data. Instead of qualitative variables, we can also speak of "categorical variables".

1.4.1.1. Nominal variables

If the categories defined by the values of a qualitative variable have no inherent natural order (e.g., from lower to higher or from worse to better), the variable is said to be "nominal". Gender, nationality or religion are examples of nominal variables.

Numbers are often allocated to the values of nominal variables for coding purposes (e.g., to save space). However, no meaningful calculations can be made with these numbers. For instance, if we allocate \(1\) to "male" and \(2\) to "female", how should \(2 - 1 = 1\) be interpreted in a meaningful way?

1.4.1.2 Ordinal variables

Ordinal variables take values which can be brought into a meaningful natural order, as is the case with the variable alcohol consumption with the values "daily", "weekly" and "seldom/never". In Medicine, variables like the stage of a disease or the perception of pain (with different pre-defined levels of intensity) are ordinal.

Numbers are also allocated to the values of ordinal variables for coding purposes, and it is obvious that the numbers must then reflect the order between the categories (e.g., \(1\) for "seldom/never", \(2\) for "weekly" and \(3\) for "daily" in the variable "alcohol consumption"). In general, differences between these numbers have no meaningful interpretation either.

1.4.2. Quantitative variables

For a variable to be quantitative, its values must be of a numerical nature. Examples of quantitative variables are body height, age or the number of children in a family. Quantitative variables are divided into discrete and continuous variables.

1.4.2.1 Quantitative discrete variables

Discrete quantitative variables only take certain numerical values. For example the variable "number of children in a household" can only take the values \(0, 1, 2, ...,\) etc. (notice that the observational units are households in this case). Other examples are the number of physician's visits in the last \(12\) months or the number of healthy teeth if the observational units are persons. The values of quantitative discrete variables are typically assessed by counting.

In general variables which only possess a finite number of values are called discrete. This is why basically all qualitative variables are discrete. The so-called "Visual Analogue Scales" are an exception and will be shortly introduced at the end of this chapter.

1.4.2.2 Continuous data

Variables are called "continuous" if their values can, in principle, be measured at any degree of accuracy. A typical example of a continuous variable is body height, which could be measured with increasing accuracy in \(cm\), \(mm\), \(\mu m\), ... . The values of continuous variables are always obtained by measurements.

When analysing and interpreting data, discrete variables with a large number of values are often treated as if they were continuous (e.g. the number of leukocytes).

With continuous data, we additionally distinguish between interval- and rational-scaled variables. Body temperature is a typical example of an interval-scaled variable, as it has no meaningful zero-point, and ratios between body temperatures (e.g., \(38^{\circ} / 37^{\circ} = 1.027\)) cannot be interpreted meaningfully. A typical rational-scaled variable is body height. In this case we have an absolute zero-point and ratios can be interpreted. For instance, if a child has grown from \(1.6 m\) to \(1.7 m\) in one year, the ratio equals \(1.7/1.6 = 1.0625\) and one can say that the child's height increased by \(6.25\%\).

Table 1.2: Summary of the types of variables
Type Definition
nominal (qualitative) Values of nominal variables define categories without a natural order
ordinal (qualitative) Yalues of ordinal variables define categories with a natural order
discrete quantitative values of discrete quantitative variables have a quantitative meaning and their number is finite.
continuous (quantitative) Values of continuous variables can, in principle, take any value in a given interval on a numerical scale

You may now first open the applet "Grouping variables 1" and then the applet "Grouping variables 2" to test whether you are able to assign variables correctly to the four different types.

Applet "Grouping variables 1"

Figure 1.2. Applet "Grouping variables 1"

Applet "Grouping variables 2"

Figure 1.3. Applet "Grouping variables 2"

1.4.3 Visual Analogue Scales

In the past decades, so-called "Visual Analogue Scales" (VAS) have become very important. A patient can e.g. record his/her present level of pain as a specific point on a visual analogue scale whose endpoints represent the extreme values (e.g. \(0\) = complete absence of pain, \(100\) = maximum level of pain). Such scales belong to the category of ordinal variables, but are often treated like quantitative data in statistical analyses.

Every hospital doctor is familiar with the dolometer, on which a patient can indicate his/her pain level, using a flexible line, so that the practitioner can read the respective numerical value on the back side, which is invisible to the patient.

Figure 1.4. Dolometer

1.5. Summary

The population consists of all potential observational units, on which one would like to make a statement.

A sample is a selection of observational units from the population. Ideally, the observational units of a sample are drawn randomly. If each individual of the population has the same chance of being drawn into the sample, we speak of a simple random sample.

A variable is a characteristic, which is measured or observed on observational units. Variables whose values have no quantitative meaning are called qualitative variables, and variables whose values represent quantities are called quantitative variables.

Qualitative variables are sub-divided into nominal and ordinal variables: the values of ordinal variables have an inherent natural order, while those of nominal variables do not.

Quantitative variables are sub-divided into discrete and continuous variables: while discrete quantitative variable can only take specific numerical varlues (e.g., counts), continuous variables can, in principle, take any value of a numerical interval.

Figure 1.5. Scheme of chapter 1 with gaps