Chapter 11

Comparison of the means of two independent samples

Often, in clinical studies, two different treatments are tested in separate groups of patients. In epidemiological studies, two different sub-populations are often compared with regard to a specific variable. If this is a quantitative variable and we want to find out if its mean value differs significantly between the two observed groups, the following tests can be used:

For normally distributed data the two-sample \(t\)-test is used. Otherwise the nonparametrical Wilcoxon rank sum test (also called the Mann-Whitney- U-test) provides an alternative.

Educational objectives

After having worked through this chapter you will be familiar with the tests which are used to evaluate differences in the location of two independent samples with respect to a quantitative variable.

You will also know the conditions for the validity of these tests, and you will be able to decide which tests to apply in concrete situations.

You will be able to classify statistical testing situations based on a given set of criteria.

Key words: Two-sample \(t\)-test, Wilcoxon rank sum test or Mann-Whitney-U-test, independent samples.

Previous knowledge: mean value, median (chap. 3), normal distribution (chap. 6), statistical hypotheses, statistical hypothesis testing, significance level, p-value (chap. 9)

Central questions: Which test should be used to evaluate the difference in the location of two independent samples with respect to a quantitative variable?

Figure 11.1. Scheme of chapter 11

11.0. Introduction

So far, we have examined situations which could be translated into settings with just one sample (situations with one single sample or with two paired samples). In this chapter we will treat situations with two independent samples. We again start with an example:

Several theories concerning the development of schizophrenia imply changes in the activity of the hormone dopamine in the central nervous system. In a study by Sternberg et al. (1982) [1], \(25\) hospitalised schizophrenia patients were treated with an antipsychotic drug and classified as psychotic (\(10\) patients) or non-psychotic (\(15\) patients) after the treatment. Furthermore the dopamine activity in the liquor (cerebro-spinal fluid) was determined for every patient.

Additionally the ratio between the level of the dopamine beta hydroxylase (DBH) and the protein concentration was determined. For the group of the psychotic patients this ratio reached a mean of \(0.0243\) resp. \(2.43\%\) and for the group of non-psychotic patients a mean of \(0.0164\) resp. \(1.64\%\). The question arises if the mean DBH-concentration of psychotic patients (\( = \mu_P \) ) differs from the one of non-psychotic patients (\(= \mu_N\) ). This means that we are interested in the alternative hypothesis \[ H_A : \mu_P \neq \mu_N \,,\] with the corresponding null-hypothesis \[ H_0 : \mu_P = \mu_N \] In the following two sections we will introduce procedures for such testing problems.

11.1. Two-sample \(t\)-test

11.1.1 Two-sample \(t\)-test for equal variances

The two-sample \(t\)-test for equal variances treated in this section is also referred to as "classical two-sample \(t\)-test.

Under the two conditions

    (i) the variable \(X\) is normally distributed in both populations compared,
    (ii) \(X\) has the same standard deviation (or variance) in both populations,

the test-statistic of the two-sample \(t\)-test for equal variances follows a \(t\)-distribution with \(n_1 + n_2 - 2\) degrees of freedom. we can treat our testing problem with the two-sample \(t\)-test for equal variances (classical \(t\)-test).

In practice, the two conditions can be slightly relaxed, by requiring approaximate normality and similar variances (or standard deviations).

We look at the Q-Q-plot of the data of the schizophrenia example in figure 11.2 and reason that both the measurements of psychotic and non-psychotic patients are approximately normally distributed.

Figure 11.2. Q-O-plots of the data in the two patient groups

The sample standard deviation of the measurements of the psychotic and non-psychotic patients are \( s = 0.00514 \) and \( s = 0.00470 \), respectively. The ratio of the standard deviations does not deviate much from 1. Therefore we can hypothesize that the standard deviations at the population level are not very different either.

The respective test statistic consists of a standardised difference of the two sample means.

As in the one-sample \(t\)-test, the difference of the mean values is divided by its standard error. The resulting ratio \( t \) has a \(t\)- distribution with \( n_1+n_2-2 \) degrees of freedom under the null hypothesis and the conditions stated above. Here, \(n_1\) and \(n_2\) denote the two sample sizes. The calculation of the standard error, however, is a bit more complicated than in the one-sample \(t\)-test. It is given in section 11.5 for those interested.

We apply the two-sample \(t\)-test to the \(25\) observations and get a \(p\)-value of \( 0.0007 \). Therefore we can reject \( H_0 \) at the significance level \( \alpha = 0.05 \) and decide in favor of the alternative hypothesis that the mean DBH-concentration of the liquor differs between psychotic and non-psychotic patients.

An important characteristic of the two-sample \(t\)-test is that the normality condition can be ignored in large samples. This follows from the approximate normality of the sampling distribution of the mean, which was discussed in chapter 8.

However, the condition of approximately equal standard deviations (or variances) must be fulfilled even with large samples, unless the two sample sizes are equal or similar. Of course, a natural question to ask is how big a large sample must be. There is no general answer to this question.

However, there are common situations in which concrete answers are possible. For instance, in experimental studies, equally sized groups are common. In this case, a group size \(n \geq 40 \) is generally sufficient for the classical two-sample \(t\)-test to be valid [cf.[4], pp 418 ff].

In a study by Evans et al. (1991) (cf. [2], page 231 and [3]) the influence of diabetes on foot-problems was examined. For this study a tendon reflex was examined in \(79\) non-diabetics and \(74\) diabetics who were between \(70\) and \(90\) years old. Measurements were graded with a score (\(1-3\)). The observed mean values and standard deviations are listed in table 11.1.


Table 11.1: Tendon reflexes in diabetics and non-diabetics
Sample sample size Mean value Standard deviation
diabetics 74 2.1 1.1
non-diabetics 79 1.6 1.2

The question arises if the mean reflex of diabetics differs from the one of non-diabetics. The standard deviations are similar in both groups and the sample sizes are both larger than \(25\). Thus, we can use the two-sample \(t\)-test to solve this testing problem. It provides a \(p\)-value of \(0.008\). Thus, the mean tendon reflex score differs significantly between the two groups at the \(5\%\)-level.

Synopsis 11.1.1

In the case of two independent samples of quantitative data, the two-sample \(t\)-test can be used to examine the difference in the two sample means if

1) the data are close to normally distributed in both populations and the variances are similar.

2) the sample sizes or the variances are similar and \( n_1+n_2 \geq 40 \).

In case of doubt, the \(t\)-test for unequal variances, which is treated in the next section, should be used. This particularly applies to doubts whether the variances may be considered as sufficiently similar or not.

11.1.2. The two-sample t-test for unequal variances

If the sample sizes are similar, the classical two-sample \(t\)-test is robust against violations of the "equal variance"-condition. However, if both the sample sizes and variances are different, the classical two-sample \(t\)-test becomes invalid.

As an alternative, the two-sample \(t\)-test for unequal variances can be used. It is also known under the name "Welch-test".

In this test, the standard error of the difference of the sample means has a different form (cf. section 11.5).

Unfortunately, the statistic of this test does not follow an exact \(t\)-distribution under the null hypothesis of equal population means. However, if the sample sizes are not too small, the null-distribution of this test-statistic can be well approximated by a \(t\)-distribution with an "adjusted" number of degrees of freedom.

The formula for the computation of this adjusted number is quite complicated. Fortunately, this test is implemented in most statistics programs. Moreover, using the smaller of the two numbers \(n_1-1)\) and \(n_2-1)\) for the degrees of freedom, one obtains a conservative test, i.e., a test with a type I-error probability \(\lt \alpha\).

The authors of [1] recommend using this test rather than the classical two-sample \(t\)-test in case of doubt. They give the following recommedations (cf.[4], pp 418 ff) for the use of this test.

a) If \(n_1+n_2 \leq 15\), then the variable \(X\) should look close to normally distributed in both samples.

b) If \(n_1 + n_2 \geq 40\), the test can even be used if \(X\) has a clearly skewed distribution.

If the \(t\)-test for unequal variances is applied to the two examples of section 11.1.1, the \(p\)-values turn out to be \(0.0011\) and \(0.008\), respectively. Thus, the \(p\)-value slightly increases in the first example, but remains the same in the second one. In both examples, the null-hypothesis of equal population means can again be rejected at a two-sided significance level of \(\alpha = 0.05\).

11.2 The Wilcoxon rank sum test (or Mann-Whitney-U test)

In a study by Anionwu E. et al. (1981) (cf. [5]) it was examined if the type of disease has an influence on the haemoglobin concentration in the blood of patients with sickle cell disease. The haemoglobin concentration was measured (in \(g/dl\)) in \(16\) patients with type HBSS and \(10\) patients with type HBST. The respective measurements are illustrated in a boxplot in figure 11.3

Figure 11.3. Boxplots of the haemoglobin measurements [\(g/dl\)] in the two samples

We further take a look at the Q-Q plot of the measured values of both groups in figure 11.4.

Figure 11.4. Q-Q-plots of the haemoglobin measurements [g/dl] in the two samples

After having examined the graphs we doubt that the data are normally distributed in the underlying populations. Furthermore the sample sizes are small. Therefore the two-sample \(t\)-test is not a valid option for the analysis of these data.

On the other hand we assume that the data of both groups are equally distributed except for their location, even though the illustration in figure 11.3 does not exactly support this. (In the following section we will briefly discuss the consequences of a violation of this condition).

The sample standard deviations are \(0.85\) in the HBSS-group and \(1.29\) in the HBST-group. Since the ratio of the two standard deviations does not deviate too much from 1, we can hypothesize that the population standard deviations are not too different either. Under the condition that the distributions of heamoglobin are identical in both populations up to a shift (i.e., that they are congruent), we can apply the so-called "Wilcoxon rank sum test" to test the null hypothesis \[ H_0 : \mu_{HBSS} = \mu_{HBST} \] against the two-sided alternative hypothesis \[ H_A : \mu_{HBSS} \neq \mu_{HBST} \] where \( \mu_{HBSS} \) denotes the mean haemoglobin concentration in the HBSS-population and \( \mu_{HBST} \) the mean haemoglobin concentration in the HBST-population.

As with the Wilcoxon signed rank test, the test statistic \( W \) of the Wilcoxon rank sum test is based on the ranking of the values and not on the measured values themselves. While the two-sample \(t\)-test is based on the difference in the sample means of the original data, the Wilcoxon rank sum test uses the difference in the means of the corresponding ranks. The rank variables of the haemoglobin data are listed in table 11.2.


Table 11.2: Haemoglobin values and their rankings
HBSS-groupHBST-group
ValueRankValueRank
7.21
7.72
8.03
8.054
8.105
8.306
8.397
8.408
8.509
8.6010
8.7011
9.0912
9.1013
9.1114
9.2015
9.8016
10.0017
10.118
10.319
10.420
10.621
10.922
11.123
11.924
12.025
12.126

The mean values of the ranks are \(9.625\) (HBBS-group) and \(19.7\) (HBST- group). They clearly differ from one another.

One can calculate exact \(p\)-values for small sample sizes, but for larger samples, statistics programs calculate the \(p\)-values using an approximation formula.

For our testing problem, the exact \(p\)-value is \(0.0006\). Based on this test result, we reject the null hypothesis and decide in favor of the alternative hypothesis stating that the mean haemoglobin concentration of patients with the disease type HBSS differs from the one of patients with disease type HBST.

If, up to a shift in location, the data have the same distribution in both populations (i.e., if the distributions are congruent), the difference in the mean values is equal to the difference in the medians at the population level. In this case, the Wilcoxon rank sum test will simultaneously test equality of the medians against their difference.

If the two distributions have similar shapes but different spreads, then the Wilcoxon rank sum test rather captures differences in the medians than differences in the means. Hence it can be generally said that the Wilcoxon rank sum test measures differences in the medians rather than differences in the means.

The Wilcoxon rank sum test generally captures differences between two distributions well if their cumulative distribution functions do not intersect each other.

Question Nr. 11.2.1 Which of the samples illustrated in figure 11.5 can be compared with the Wilcoxon rank sum test regarding their median?
Sample 1 and sample 2
Sample 1 and sample 3
Sample 1 and sample 4
Sample 2 and sample 3
Sample 3 and sample 5

Note: Consider the scaling of the x-axis in each case.

Figure 11.5. Histograms of quantitative variables in six different samples

Synopsis 11.2.1

Assume that a quantitative variable \( X \) has been observed in two independent random samples from different populations.

    1) If, up to a shift in location, the distributions of \( X \) look almost the same in the two samples, the Wilcoxon rank sum test can be used to test the difference between the means of \( X \) in the two samples.
    2) If the two distributions have a similar shape but a different spread, then the result of the Wilcoxon rank sum test rather relates to the difference in the medians.

Recall that empirical distributions can be very well compared using boxplots. Case 1) corresponds to a situation where the two boxplot are essentially congruent, while case 2) corresponds to a situation where the boxplots have a similar shape but a different length.

Maths enthusiasts can gain an insight into the functionality of this test with the applet "Wilcoxon rank sum test".

Figure 11.6. Applet "Wilcoxon rank sum test"

11.3 Overview of the testing situations

Figure 11.6 gives an overview of the different testing situations and the criteria which have to be considered in the choice of a statistical test. Some of these situations were treated in the present chapter and in chapter 10. Others will be treated in chapter 12.

Figure 11.7. Classification of testing situations

The primary criterion for categorizing testing situations is the type of data compared

    nominal data (e.g. gender and blood group): respective testing procedures will be introduced in chapter 12.
    quantitative data: the simplest testing procedures for quantitative data have been treated in chapter 10 and the present chapter.

In a second step the testing situations are distinguished based on the number, dependency and size of the samples to be compared. These different situations ask for different test procedure. We differentiate between

    Testing situations with one sample (as it was the case in the example of the calorie consumption of women prior to their menstruation). In this case a sample is compared to the reference value of a population.

    Testing situations with two independent samples (e.g. the calorie consumption of a group of women and of a group of men). In this case the two samples are compared with respect to the mean of a quantitative variable or the relative frequency of a specific characteristic.

    Testing situations with more than two independent samples (e.g. the calorie consumption in various age classes of women). In this case the mean values or the relative frequencies of multiple samples are compared. Mean values of \(3\) or more samples can be compared using the method of "analysis of variance", which will not be treated in this course. A good introduction to the analysis of variance is given in [6]. The comparison of relative frequencies of multiple independent samples will be introduced in chapter 12.

    Testing situations with two dependent samples (e.g. the calorie consumption of a group of women prior to and after their menstruation, where the first sample consists of the values measured prior to the menstruation and the second sample of the values measured after the menstruation).

The following flow chart provides an overview of the testing situations with quantitative variables observed in one sample, two paired samples or two independent samples

Figure 11.8. Flowchart for test selection

Question Nr. 11.3.1.1 Are the samples involved in the following statistical comparison independent (I) or dependent (D):
Comparison of the haemoglobin level of smokers and non-smokers.
I
D

Question Nr. 11.3.1.2 Are the samples involved in the following statistical comparison independent (I) or dependent (D):
Comparison of the dexterity of the right and left hand in a sample of students.
D
I

Question Nr. 11.3.1.3 Are the samples involved in the following statistical comparison independent (I) or dependent (D):
Comparison of the dexterity of right- and left-handers in a sample of students
I
D

Question Nr. 11.3.1.4 Are the samples involved in the following statistical comparison independent (I) or dependent (D):
Comparison of the power of concentration in the evening and in the morning in a sample of students.
D
I

Question Nr. 11.3.2.1 Match the variable underlying the following comparison to the classification criteria 'qualitative data (= A)' and `quantitative data (= B)':
Comparison of the haemoglobin level (g/dl) of smokers and non-smokers.
A
B

Question Nr. 11.3.2.2 Match the variable underlying the following comparison to the classification criteria `qualitative data (= A)' and `quantitative data (= B)':
Comparison of two lie detectors (result: true, false).
A
B

Question Nr. 11.3.2.3 Match the variable underlying the following comparison to the classification criteria `qualitative data (= A)' and `quantitative data (= B)':
Comparison of the frequency of asthma in Davos and in Basel based on the data of representative surveys.
A
B

Question Nr. 11.3.2.4 Match the variable underlying the following comparison to the classification criteria `qualitative data (= A)' and `quantitative data (= B)':
Comparison of body heights of conscripts from different countries.
N
I

Synopsis 11.3.1

A testing situation can be characterised based on the following criteria:

a) the data type

b) the number of samples

c) the dependency structure of the sample(s)

d) the sample size(s).

11.4 Summary

We have the following three options to test if the difference in the means of a variable \( X \) from two random samples is statistically significant:

The two-sample \(t\)-test for equal variances can be applied in most situations, if

    a) the values of \(X\) look close to normally distributed in both samples or the sample sizes are large,
      and
      b) the two standard deviations or the two sample sizes are similar.

    The two-sample \(t\)-test for unequal variances can be applied in most situations, as long as the sum of the two sample sizes exceeds 40.

    Alternatively, the Wilcoxon rank sum test can be used, if, up to a shift in location, the distributions of \(X\) look almost the same in the two samples.

    If the distribution of \(X\) has a similar shape but a different spread in the two samples, the Wilcoxon rank sum test captures differences between the medians rather than between the means.

    In general, a testing situation can be classified according to the criteria "data type", "number of samples", "sample size" and "dependency structure of the samples".

    Figure 11.9. Scheme of chapter 11 with gaps

    11.5. Mathematical appendix

    Assume that a variable \( X \) is normally distributed in two populations with mean values of \( \mu_1 \) and \( \mu_2 \) and standard deviations \( \sigma_1 \) and \( \sigma_2 \), and that we have two independent random sample of size \( n_1 \) and \( n_2 \) from these populations. Then the standard error of the difference \( \bar{x}_2 - \bar{x}_1 \) equals \( \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \). It can be estimated by \[ SE_1 = \sqrt{ \frac{s_1^2}{n_1}+\frac{s_2^2}{n_2} } \,, \] where \( s_1 \) and \( s_2 \) are the standard deviations of \( X \) in the two samples.

    This is the estimate of the standard error of \( \bar{x}_2 - \bar{x}_1 \) in the two-sample \(t\)-test for unequal variances.

    However, if \( \sigma_1 = \sigma_2 \), a better estimate is \[ SE_2 = \sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \times \sqrt{ \frac{ (n_1-1)\times s_1^2 + (n_2-1)\times s_2^2}{n_1 + n_2 - 2}} \,.\] This is the estimate of the standard error of \( \bar{x}_2 - \bar{x}_1 \) in the two-sample \(t\)-test for equal variances.

    Under the condition that

    • \(X\) is normally distributed in both populations,
    • \( \sigma_1 = \sigma_2 \), and
    • \( \mu_1 = \mu_2 \),
    the ratio \[ t = \frac{\bar{x}_2 - \bar{x}_1}{SE_2} \] has a \(t\)-distribution with \( n_1 + n_2 - 2 \) degrees of freedom.

    For large sample sizes, the ratio \[ t = \frac{\bar{x}_2 - \bar{x}_1}{SE_1} \] approximately follows a standard normal distribution under the null hypothesis that \( \mu_2 = \mu_1 \), even if \( \sigma_2 \) and \( \sigma_1 \) or \( n_1 \) and \( n_2 \) are clearly different, provided that the distribution of \( X \) is not too skewed. Thus, in this case the critical limits for the observed \(t\)-statistic can be set to \(-1.96\) and \(1.96\).

    References

    [1] D.E. Sternberg, D.P. VanKammen, P. Lerner, W.E. Bunney (1982)
    Schizophrenia: dopamine beta-hydroxylase activity and treatment response.
    Science 216 (6), pp 1423-5, DOI: 10.1126/science.6124036

    Data copied from page 37 of

    D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway and E. Ostrowski
    A Hndbook of Small Data Sets
    Chapman and Hall, London, 1994

    [2] W.W. Daniel, (1995)
    Biostatistics
    John Wiley and Sons Inc.

    [3] S.L. Evans, B.P. Nixon, I. Lee, D. Yee, A.D. Mooradian (1991)
    The prevalence and nature of podiatric problems in elderly diabetic patients
    J Am Geriatr Soc 39 (3), pp 241-5, DOI: 10.1111/j.1532-5415.1991.tb01644.x

    [4] D.S. Moore, G.P. McCabe, (1993)
    Introduction to the practice of statistics
    W.H. Freeman and Company, New York

    [5] E. Anionwu, D. Watford, M. Brozovic, B. Kirkwood (1981)
    Sickle cell disease in a British urban community
    British Medical Journal (Clin Res Ed) 282, pp 283-6, doi: 10.1136/bmj.282.6260.283.

    Data copied from page 247 of

    D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway and E. Ostrowski
    A Handbook of Small Data Sets
    Chapman and Hall, London, 1994

    [6] D.G. Altman, (1994)
    Practical Statistics for Medical Research
    Chapman and Hall, London.