Chapter 9

Statistical hypothesis testing: terms and concepts

The progress of empirical sciences comes with the formulation and testing of new hypotheses, which are helpful for a clearer and deeper understanding of phenomena. Often new strategies for practice can be derived from them. A hypothesis is considered as valid if it has been supported by empirical data and until it is disproved by new empirical data. In fact, a hypothesis can never be fully proved.
Statistics provides methods to quantify how well empirical data, which were gained under controlled conditions, are in agreement with a certain hypothesis, thereby taking into account the influence of chance.

Educational objectives

After having worked through this chapter you will know the structure of a statistical hypothesis test, the various test decisions, the errors that can be made, the probabilities of these errors, as well as the relation between these terms. You will furthermore know how to formulate the null hypothesis and the alternative hypothesis for a test situation.

Key words: null hypothesis, alternative hypothesis, test statistic, rejection region of \( H_0 \), type I error, \( \alpha \)-error probability, significance level, type II error, \( \beta \)-error probability, statistical power, \( p \)-value

Previous knowledge: nominal data, quantitative data (chap. 1), sample, mean value (chap. 3), normal distribution (chap. 6), sampling distribution (chap. 8)

Central questions: What is a statistical hypothesis test? What does the result of a statistical hypothesis test mean? Which errors can be made in such a test? How can these errors be controlled?

Figure 9.1. Scheme of chapter 9

9.1. The basic elements of statistical hypothesis tests

Do women eat too little before their menstruation?

It is suspected that the daily energy consumption of women before their menstruation lies under a recommended value of \( 7425 \text{kJ} \). In a study by Manocha et al. (1986) (cf. [1], page 188), the average daily energy consumption was recorded for a period of 10 days among 11 randomly chosen healthy women aged between 20 and 30 years prior to their menstruation. The following values in kJ/day were measured in this study: \[ 5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770 \text{ kJ} .\]

The 11 women had an average consumption of \( 6753.6 \text{kJ} \) per day. We already know from chapter 8, that the sample mean represents a random value. Can we now state, based on this sample, that the mean energy consumption prior to the menstruation phase, (\( \mu \)), in a population of healthy women between 20 and 30 years of age is indeed lower than the recommended value? In order to answer this and similar questions, the concept of statistical hypothesis testing was introduced.

9.1.1 Statistical hypotheses

If we have the impression that the mean daily energy consumption \( \mu \) in the population of women in the pre-menstruation phase lies under the recommended value of \( 7425 \text{ kJ} \), and if we want to see whether the data support this assumption, we formulate the assumption in the hypothesis \( H_A: \mu \lt 7425 \text{kJ} \).

The converse hypothesis will be that the mean energy consumption in this sub-population of women is \( \geq 7425 \text{ kJ} \). This hypothesis is called the null hypothesis (\(H_0\)).

It would be natural to consider our own hypothesis as the primary one and the null hypothesis as the "alternative hypothesis". However, we will see that the null hypothesis plays a fundamental role in deciding whether \( H_A \) is sufficiently supported by the observed data or not. This is why \( H_0 \) is considered the "point of departure" from a theoretical perspective, while the researchers' hypothesis represents the alternative to the null hypothesis.

Definition 9.1.1

In the alternative hypothesis \( H_A \), we postulate

  • a difference between two populations,
  • a specific treatment effect, or
  • or a specific relation between two variables
The null hypotheses \(H_0\) denies the existence of such a difference, treatment effect or relation.

Synopsis 9.1.1

Tn a first step we phrase our hypothesis \( H_A \) and the null-hypothesis \( H_0 \). In a second step, we collect appropriate data for testing \( H_A \).

In our example, we have

    \( H_0: \mu \geq 7424 \text{ kJ} \)
    and
    \( H_A: \mu \lt 7424 \text{ kJ} \)

with \( \mu \) denoting the unknown mean daily energy uptake of women before their menstruation.

9.1.2 Statistical hypothesis tests involve a test statistic

In our example, we can use the observed mean value of the energy consumption as a test statistic in order to decide whether to reject the null hypothesis \( H_0 : \mu \geq 7425 \text{ kJ} \) (and hence decide in favor of the alternative hypothesis) or not. The value of our test statistic, namely \( 6753.6 \text{ kJ} \), rather seems to speak in favour of \( H_A : \mu \lt 7425 \text{ kJ} \). But if we had observed a sample mean of \( 6000 \text{ kJ} \) in another sample of 11 women, this would speak even more strongly in favour of\( H_A \) than the value \( 6753.6 \text{ kJ} \).

Definition 9.1.2

A test statistic is a measure, which is calculated from a sample according to a given rule and which is suitable to test an alternative hypothesis.

Synopsis 9.1.2

After having formulated the null- and the alternative hypothesis, we calculate a test statistic which is suitable for testing the given alternative hypothesis.

Question Nr. 9.1.1 For which of the following potentially observed mean values would you rather decide in favour of HA?
8000 kJ
6200 kJ
7425 kJ
7725 kJ

What is obviously missing, is an upper limit \( k \) for observed mean values allowing us to reject \( H_A \). We will first consider such a limit \( k \) as given (but we will soon see how to determine \( k \)). We then decide in favour of \( H_A \) and thus reject the null hypothesis \( H_0 \), if the observed mean value is smaller than \( k \). If the observed mean value is \( \geq k \), we cannot reject \( H_0 \). The non-rejection and the rejection region of \( H_0 \) for this procedure are illustrated in the following figure.

Figure 9.2. Schematic illustration of the non-rejection and the rejection region of \( H_0 \)

Definition 9.1.3

The rejection region of \( H_0 \) includes those values of the test statistic for which we reject \( H_0 \).

In a more general way, we can describe statistical hypothesis testing as a procedure which enables deciding whether or not to reject \( H_0 \), resp. whether or not to decide in favour of \( H_A \), based on the data of a random sample. Most statistical hypothesis tests are conducted with the help of a test statistic. Such a test statistic can be based on the mean, the median or another measure of the sample. The test consists in making a decision in favour or against the alternative hypothesis, based on the value of the test statistic. To do so, we divide the range of possible values of the test statistic into a rejection and a non-rejection region for \( H_0 \) and reject \( H_0 \) (i.e., decide in favour of \(H_A\)), if our test statistic lies in the rejection region.

Synopsis 9.1.3

In a third step we either decide

in favour of \( H_A \) and to reject \( H_0 \), if the value of the test statistic lies in the rejection region of \( H_0 \),

or

not to reject \( H_0 \), if the value of the test statistic does not lie in the rejection region of \( H_0 \).

9.1.3 Type I and type II errors

As we have learnt in the previous section, we can decide whether or not the null hypothesis \( H_0 \) is compatible with the data at hand, based on a statistical hypothesis test. Table 9.1 illustrates the possible situations which we can encounter in such a decision.


Table 9.1: The different situations of a statistical test
Decision Reality
\( H_0 \) true \( H_A \) true
non-rejection of \( H_0 \) correct decision false decision (type II error)
rejection of \( H_0 \) false decision (type I error) correct decision

The table shows that we can make two false decisions, which are referred to as "type I error" and "type II error". In Table 9.2., the decision procedure of a statistical test is compared with a medical diagnostic test. If the disease in question is abbreviated by \(D\), the situation "patient does not have \(D\)" takes the role of \( H_0 \), while the situation "patient has \(D\)" takes the role of \( H_A \). In this case we can also make two false decisions, i.e., by wrongly diagnosing a person with \(D\) as being free of \(D\) based on a false negative test \( \sim \) type II error) or a patient without \(D\) as having \(D\) based on a false positive test \( \sim \) type I error).


Table 9.2: The different situations of disease diagnosis
Test result State of patient
without disease \(D\) with disease \(D\)
negative correct decision false decision (false negative)
positive false decision (false positive) correct decision

We will now first take a closer look at type I errors.

Type I error:

The accepted probability for the occurrence of a type I error is denoted by \( \alpha \). It defines the so-called "significance level" of the test and it is usually fixed at \( \alpha = 0.05 \). Therefore the probability that \( H_0 \) is wrongly rejected in favour of the alternative hypothesis must not be larger than \( 0.05 \) ( or \( 5\% \)).

Definition 9.1.4

The rejection of a correct null hypothesis in favour of an incorrect alternative hypothesis is referred to as "type I error". The probability of such an error is called "\( \alpha \)-error probability".

Definition 9.1.5

The significance level of a statistical test is the accepted probability for a type I error.

Synopsis 9.1.4

If the null hypothesis is correct and if it is tested at the significance level \( \alpha \), then it is wrongly rejected with a probability of at most \( \alpha \).

Exercise/Simulation

We take up the example of the mean daily calorie consumption \( \mu \) of women between 20 and 30 years of age before their menstruation. We are now interested in the hypothesis \( H_0 : \mu \geq 7425 \) against the hypothesis \( H_A : \mu \lt 7425 \). For this testing problem, we will examine 1000 random samples of 11 women each, in which the individual daily energy uptakes are assumed to be normally distributed with a mean value of \( 7425 \text{ kJ} \) and a standard deviation of \( 1142 \text{ kJ} \). As a test statistic we will use the sample mean value of the 11 measurements. The rejection region of \(H_0\) is of the form \( [-\infty, k )\). With the applet "Type I error", we can determine the value of \( k \) for which \( \alpha \) equals \( 0.05 \).

Figure 9.3. Applet "Type I error"

Question Nr. 9.1.2.1 What value should the critical limit k approximately have for our test to maintain a significance level α of 0.05?
about 7120 kJ
about 6860 kJ
about 7270 kJ
about 6120 kJ

Question Nr. 9.1.2.2 What value should the critical limit k approximately have for our test to mainain a significance level α of 0.1?
6860 kJ
7000 kJ
7120 kJ
7510 kJ

Question Nr. 9.1.2.3 What value should the critical limit k approximately have for our test to maintain a significance level α of 0.15?
6420 kJ
7000 kJ
7070 kJ
7890 kJ

Type II errors:

Until now, the alternative hypothesis was formulated in a very general way. However, the probability of a type II error can only be computed, if we postulate a specific value \( \mu \) within \( H_A \). Such a specific alternative hypothesis, which we will denote by \( H_1 \), is called "working hypothesis".

A type II error occurs, if \( H_A \) is correct but our test statistic lies in the non-rejection region of \( H_0 \). The probability of a type II error is denoted by\( \beta \). If we worked with a critical limit \( k = 6500 \text{ kJ} \) in our example of the calorie consumption, then the observed value \( 6753.6 \text{ kJ} \) would lie in the non-rejection region of \( H_0 \).

Let us now assume that the true mean value of daily energy uptake in our sub-population of women equals \( \mu = 7100 \text{ kJ} \) (working hypothesis). Then \( H_A : \mu \lt 7425 \text{ kJ} \) is correct and we wrongly do not reject the null hypothesis \( H_0 \), i.e. we commit a type II error.

Figure 9.4. Applet "Type I and II errors"

Question Nr. 9.1.3 Which of the following statements would you confirm based on your observations?
the smaller k, the smaller α and the bigger β
the smaller k, the smaller α and the smaller β
the smaller k, the bigger α and the smaller β
the bigger α, the smaller β

Question Nr. 9.1.4.1 Which is the approximate value of β if we assume that μ = 6700 kJ and if we use a value around 6840 kJ vor k?
about 0.125
about 0.285
about 0.355
about 0.515

Question Nr. 9.1.4.2 How would we have to choose k, if we wanted to get β ≈ 0.2 for a value of μ of about 6770 kJ?
approx. 7070 kJ
approx. 7320 kJ
approx. 7550 kJ
approx. 7750 kJ

Question Nr. 9.1.4.3 What is the value of α if k is chosen as in question 9.1.4.2?
about 0.05
about 0.15
about 0.5
about 0.95

Question Nr. 9.1.4.4 The mean of the observed sample is 6754 kJ. Would you rather decide for H0 or for HA if a significance level α of 0.05 is used?
decide for H0
decide for H1

Definition 9.1.6

Failing to reject the null hypothesis if the alternative hypothesis is correct, is referred to as type II error. The probability of committing a type II error is called \( \beta \)-error probability.

Another important value in a statistical testing problem is the so-called "statistical power". The statistical power of a test with a given significance level \( \alpha \) is the probability of rejecting the null hypothesis \( H_0 \) if the working hypothesis \( H_1 \) is correct. The statistical power of a hypothesis test is closely linked to the \( \beta \)-error probability since the following applies: \[ \text{statistical power} = 1 - \beta \] .

Now work through the applet "Type I and type II errors, n variable" in order to examine the relation between the statistical power and the sample size.

Figure 9.5. Applet "Type I and II errors, variable sample size"

Question Nr. 9.1.5.1 Assume that the true value of μ equals the observed value of 6754 kJ and use the applet `Type I and type II errors, n variable' to answer the following question.
What sample size n would we have to choose if we wanted to get a significant result at the level α = 0.05 in another random sample of women prior to their menstruation with a probability 1-β = 0.90?
Note: The null hypothesis is still the same as at the beginning.
Hint: First fix μ at about 6754 kJ and choose a new value for n. Then adjust k in such a way that α becomes approx. 0.05. Now take a look at the resulting β-error. Its value tells you whether the new value of n is too big or too small. Correct the value of n accordingly and repeat the above steps. If you always draw the right conclusion you will get the correct result relatively fast.
The value of n must be about 18 or 19
The value of n must be about 24 or 25
The value of n must be about 30 or 31
The value of n must be about 10 or 11

Question Nr. 9.1.5.2 A statistical power of 1 - β = 0.80 is often considered sufficient. Which value would n have to take in this case?
The value of n must be about 17 or 18
The value of n must be about 21 or 22
The value of n must be about 27 or 28
The value of n must be about 11 or 12

Definition 9.1.7

The probability of being able to reject the null hypothesis, under the assumption that \( H_1 \) is correct, is referred to as statistical) power.

Notice that the power depends on the significance level chosen.

Synopsis 9.1.5

The following relation holds between the power of a statistical hypothesis test and the probability \( \beta \) of a type II error:

power = \( 1 - \beta \) .

We can also illustrate the probabilities of the different test decisions in a two by two table (Table 9.3). As in tables 9.1 and 9.2, we will again compare the respective decision probabilities with those of a diagnostic test. Again, we let \( H_0 \) correspond to the situation of a "patient without the disease \(D\)" and \( H_A \) to the situation of a "patient with the disease \(D\)".


Table 9.3: Analogy between probabilities associated with statistical tests and probabilites associated with diagnostic tests
\( H_0 \) = true \( H_A \) = true
Probability of not rejecting \( H_0 \) \( 1 - \alpha \) \( \beta \)
Probability of rejecting \( H_0 \) \( \alpha \)
(= significance level)
\( 1 - \beta \)
(= power)
without disease with disease
probability of negative test result specificity 1 - sensitivity
Probability of positive test result 1-specificity sensitivity

9.1.4 The determinants of type II errors and statistical power

In the following figure, the sampling distributions of the mean values of the calorie consumption of \(n\) women before their menstruation are illustrated under \( H_0 \) (right density curve with the centre \( \mu_0 \)) and under \( H_1 \) (left density curve with the centre \( \mu_1 \)).

Figure 9.6. Determinants of type II errors

    a) Situation with \( \mu_0 = 7425 \text{ kJ}, \, , \mu_1 = 6500 \text{ kJ} \) and \( n = 11 \), with a significance level \( \alpha = 5\% \) (vertically shaded area) and a type II error probability \( \beta = 15\% \) (horizontally shaded area).
    b) Same situation as in a), but with a lower significance level \( \alpha = 1\% \). Accordingly, the probability \( \beta \) of a type II error is higher (\( \approx 35\% \)).
    c) Situation with \( \mu_0 = 7425 \text{ kJ}, \, \mu_1 = 6800 \text{ kJ} \) and \( n = 11 \), with a significance level \( \alpha = 5\% \). Since \( H_0 \) and \( H_1 \) are closer to each other in this case, the probability \( \beta \) of a type II error is higher (\( \approx 45\%\)).
    d) Same situation as in c), but with a doubled sample size \( (n = 22) \). The two distributions have become narrower and thus overlap less. As a result, the probability \( \beta \) of a type II error becomes smaller and now only equals \( 18\% \).

We know from chapter 8 that the spread of the sampling distribution of a mean value is described by the standard error which depends on the following two factors:

    (i) the sample size n (square root of n law)
    (ii) the standard deviation \( \sigma \) of the variable across observational units.

Synopsis 9.1.6

The following factors influence the probability \( \beta \) of a type II error and thus the power \( 1 - \beta \) of a statistical hypothesis test (provided that \(H_1 \neq H_0\)):

a) The smaller the significance level \( \alpha \) (accepted probability of a type I error), the bigger \( \beta \) and thus the smaller the power \( 1 - \beta \) (for a fixed sample size \( n \)).

b) The smaller the difference between \( H_0 \) and \( H_1 \), the bigger \( \beta \) and thus the smaller the power \( 1 - \beta \) (for a fixed sample size \( n \).

c) The bigger the sample size \( n \), the smaller \( \beta \) and thus the bigger the power \( 1 - \beta \).

d) The smaller the standard deviation \( \sigma \) of the variable, the smaller \( \beta \) and thus the bigger the power \( 1 - \beta \) (for a fixed sample size \( n \)).

9.1.5 The p-value

The measurements of the 11 women provided a mean value of \( 6753.6 \text{ kJ} \). We are now interested in the probability of observing an even smaller mean value in a new random sample of equal size from a population with \( \mu = 7425 \text{ kJ} \) and \( \sigma = 1142 \text{ kJ} \) (thus corresponding to \( H_0 \)). This probability is referred to as "\( p \)-value" of the observed mean value. Use the applet "p-value" to find the approximate \( p \)-value of the observed mean value of \( 6753.6 \text{ kJ} \).

Figure 9.7. Applet "p-value"

Question Nr. 9.1.6 Which approximate p-value do you find for the difference between the observed sample mean 6754 kJ and the reference value of 7425 kJ?
approx. 0.02
approx. 0.98
approx. 0.04
approx. 0.5

Question Nr. 9.1.7 How must this p-value be interpreted?
Try to write down your interpretation before you start answering the question on-line.
The p-value equals the probability that the null hypothesis is correct.
The p-value equals the probability with which the observed result would have to be expected if the null hypothesis were true.
That the p-value is smaller than 0.05 in our example proves that women have an average daily calory uptake of less than 7425 kJ before menstruation.
The p-value equals the probability that all women consume less than 7425 kJ per day before their menstruation.
The p-value equals the probabiliy with which one would have to expect a result at least as unlikely as the one observed, under the null hypothesis, in a new random sample of the same size from the same population.

The \( p \)-value can be defined in a more general way as follows:

Definition 9.1.8

First we calculate the value \( t \) of an appropriate test statistic from the data of the observed sample. Then, conditional on the null hypothesis \( H_0 \) being true, we determine the probability that a new random sample of the same size from the same population would provide a test statistic at least as unlikely (under \( H_0 \) as \(t\) (i.e., arguing against \( H_0 \) at least as strongly as \( t \)).

This phrasing is tuned to small \( p \)-values. For large \( p \)-values we would rather speak of values of the test statistic which argue even less for \( H_1 \) than \( t \). A small \( p \)-value means that the probability of getting a value of the test statistic at least as "extreme" as the observed one, under the null hypothesis, is small. Accordingly, the observed value of the test statistic would then be an "extreme" value itself if the null hypothesis were true. A small \( p \)-value therefore argues against the plausibility of the null hypothesis. The following rule applies:

Synopsis 9.1.7

If the \( p \)-value is smaller than \( \alpha \), then we can reject \( H_0 \) at the significance level \( \alpha \) and decide in favour of the alternative hypothesis.

Question Nr. 9.1.8 The p-value represents the probability under H0 of observing a random sample of the same size from the same population, which argues at least as much in favour of rejecting H0 as the observed result. Which statements are correct?
The p-value equals the probability that H0 is correct.
A high p-value argues in favour of HA
A low p-value argues in favour of HA
The p-value is dependent on the choice of α and β
If HA is correct then we must expect a higher p-value in a small sample than in a large sample.

Important concluding remark: "Statistical significance" must not be confounded with "relevance". In small studies, even relevant differences may fail to translate into statistically significant results, due to a lack of statistical power. On the other hand, irrelevant differences may become statistically significant in large studies, due to excessive statistical power. Therefore, an important goal of study design is to choose the sample size(s) such that the statistical power of observing a statistically significant result will be high, in case of a relevant difference between \( H_1 \) and \( H_0 \).

9.2. Summary

A statistical hypothesis test starts with the formulation of a hypothesis \( H_A \) postulating a certain difference between two populations, a treatment effect or an existing relation between two variables. The null hypothesis, denoted by \( H_0 \), denies the existence of such a difference, treatment effect or relation. As probability calculations are primarily based on the null hypothesis, the hypothesis \( H_A \) is also called "alternative hypothesis". In order to decide whether to reject \(H_0\) (and thus to favour \( H_A \)) or not, we use a test statistic which can be calculated from one or multiple samples. The range of values of the test statistic is divided into two intervals: the rejection and the non-rejection region of \( H_0 \). The decision rule is as follows:

  • If the value of the test statistic lies in the rejection region of \( H_0 \), we reject \( H_0 \) and decide for \( H_A \).
  • If the value of the test statistic lies in the non-rejection region of \( H_0 \), we cannot reject \( H_0 \).
With this decision rule, we can commit two errors:

    1) If \(H_0\) is true and we wrongly reject \( H_0 \), we commit a type I error.
    2) If \(H_A\) is true and we wrongly do not reject \( H_0 \), we commit a type II error.

The accepted probability of a type I error is referred to as "significance level" of the test and is denoted by \( \alpha \). This value is fixed before carrying out the test. Common values are \( 0.05 \) and \( 0.01 \). With \( \alpha = 0.05 \), one accepts a type I error to occur in \( 5\% \) of cases where \(H_0\) is true. The probability of a type II error under a specific working hypothesis \( H_ 1 \) is denoted by \( \beta \). Therefore, the power of a test, i.e. the probability of being able to reject \( H_0 \) if the working hypothesis \( H_1 \) is correct, equals \( 1 - \beta \).

If the significance level is fixed, the power increases and \( \beta \) decreases with increasing sample size. The "\( p \)-value" is the probability of observing a value of the test statistic which would be at least as unlikely as the observed value under the null hypothesis. If we work with the \( p \)-value, the following decision rule applies:

    1) If the \( p \)-value is smaller than \( \alpha \), we reject \( H_0 \) and decide for \( H_A \).
    2) If the \( p \)-value is larger than or equal to \( \alpha \), we do not reject \( H_0 \).

In the first case. we say that the observed result is "statistically significant" (at the level \( \alpha \)). In the second case, the result is said to be "(statistically) non-significant". With a non-significant result, we have to consider the possibility that an observed difference has been caused by chance alone.

"Statistical significance" must not be confounded with "relevance". It must be kept in mind that the \( p \)-value strongly depends on the size of the study. The same observed difference or effect may have a large p-value in a small study and a small p-value in a large study.

Figure 9.8. Scheme of chapter 9 with gaps

References

[1] D.G. Altman, (1994)
Practical Statistics for Medical Research
Chapman and Hall, London.