Section 2: Scientific Principles
Part D: Physiology and Anesthesia
Chapter 21: Research Design and Statistics in Anesthesia

One- and Two-Sample t-Tests

The most widely used, and misused, statistical test is Student‘s t-test, a group of statistical tests designed for analysis of a single group or comparison of two groups. The name of this test refers to a pseudonym used by W.L. Gosset. Gosset‘s employer, the Guinness Brewing Company, did not permit its employees to publish their research under their own names; however, because of the importance of this work, Gosset was permitted to publish under the name Student. 43  Gosset‘s contribution was to develop the t distribution. The Z distribution (Z scores) described earlier is based on an infinite sample size. Gosset recognized that with smaller sample sizes (e.g., less than 30), the distributions differed from the exact bell-shape of the gaussian distribution. In particular, the presence of an extreme value (i.e., one far from the mean) was more likely to occur in a small sample. Gosset examined a variety of distributions of small samples and determined the frequency with which the more extreme values occurred.

Earlier we observed that in an infinitely large, normally distributed sample, 95 percent of all observations fell within 1.96 SD of the mean. Gosset found that with a sample size of 20, a slightly larger range, 2.09 SD, was necessary to include the same 95 percent of the observations. As the sample size decreased to 10, 2.26 SD were necessary; with a sample size of 5, an even larger range, 2.78 SD, was necessary to include 95 percent of all observations. These values (e.g., t = 2.09 for a = 0.05 and N = 20) comprise the t distribution. Every value in the t distribution is associated with a sample size and a value for a. As the sample size becomes large (e.g., >30), the values for the t distribution approach those for the Z distribution; with an infinite sample size, they are identical. Gosset‘s observations were then used to define three statistical tests: the one-sample, the two-sample, and paired-sample t-tests.

Parametric t-Tests

One-Sample t-tests

If an investigator measures a variable in a single group of subjects, he or she may be interested in determining whether the mean for this sample differs from zero (or alternatively, from some other specific value). This analysis can be performed using the one-sample t-test. To examine the association between diuretic drugs and acid-base status, we might identify a group of patients taking diuretic drugs, obtain arterial blood samples, and determine base excess. A hypothetic set of values is shown in Table 21–7. The mean of these values is 3.0, a fact that might suggest an association between diuretic drugs and alkalosis. However, two of the subjects have negative values and two have zero values. Because a quick and informal appraisal of the data (the “eye” test) cannot determine whether 3.0 differs significantly from zero, we must use statistical analysis—in this case, the one-sample t-test. If we assume that this sample represents the population and that the distribution of these values is normal, we can state with 95 percent confidence that the population mean lies between [mean – ( t × SE)] and [mean + ( t × SE)]. This statement is similar to an earlier one that the population mean lies between [mean – (1.96 × SE)] and [mean + (1.96 × SE)]. However, the value 1.96 (the Z value to include 95 percent of the population) must now be replaced by the value of t appropriate for the sample size, that is, 2.20. This value can be found on a table of t distribution by locating the row corresponding to the appropriate df (in this case, the value for df is one less than the sample size) and the desired value of a (typically, 0.05). Sample values for t distribution are shown in Table 21–3; a complete listing can be found in any statistical textbook. In this case, the 95 percent confidence limits for the mean are 0.47 and 5.53; these limits do not include the value zero. To determine the 99 percent confidence limits, we use the t value for a = 0.01 and 11 df, 3.11. This results in 99 percent confidence limits of –0.57 and 6.57, a range that does include zero. We can conclude with 95 percent, but not 99 percent, confidence that the use of diuretic drugs is associated with alkalosis. Note that our conclusion is limited to making an association between diuretic drugs and alkalosis rather than implying causality. Statistics does not prove causality, only the likelihood of an association.

TABLE 21–7. Hypothetic Set of Values for Base Excess for Subjects Given Diuretic Drugs

TABLE 21–3. Critical Values of the t Distribution (Two-Tailed)

An alternate approach to this problem, more familiar to some readers, is shown in Table 21–8. The division of the mean value by the SE produces a value for t. This value is compared with a value for t appropriate for the desired level of significance (usually, a = 0.05 or 5%) for the appropriate degrees of freedom. If the value for t exceeds the value from the table (known as the critical value), the null hypothesis is rejected, and we would conclude that a difference exists (at the a level) between zero and the mean value for this population. In this instance, the value for t is 2.6. The critical value for t for a = 0.05 and 11 df is 2.20; the value for t for a = 0.01 and 11 df is 3.11. The value for t exceeds the critical value for a of 0.05, but not for a of 0.01. This fact also leads us to conclude (as we did earlier) that the mean for this sample differs from zero and that the likelihood is between 95 and 99 percent.

TABLE 21–8. One-Sample Student‘s t-Test Applied to Hypothetic Data from Table 21–7

The one-sample t-test can also be applied to populations for which mean values are expected to be other than zero. For example, if we were interested in whether 90 mm Hg was the mean PaO2 of smokers, we would determine the difference between 90 mm Hg and the mean value for the sample population, divide this value by the SE of the sample to determine the t statistic, and then compare the resulting value with the critical value for t.

Two-Sample t-Tests

More commonly, we make measurements on two groups of subjects and compare the responses. Such a comparison requires use of the two-sample t-test. In this form of t-testing, two independent samples are being compared, that is, an individual datum in one sample is not associated with another datum in the second sample. For example, we might want to compare blood pressure in normovolemic and hypovolemic individuals. Hypothetic values are provided in Table 21–9. In this instance, an informal appraisal of the data would establish that a significant difference exists between the mean values. This difference can be confirmed by using the two-sample t-test to evaluate the null hypothesis that there is no difference between these two samples; that is, they come from the same population.

TABLE 21–9. Application of the One-Tailed Two-Sample Student‘s t-Test to Hypothetic Sets of Mean Blood Pressure Values for Normovolemic and Hypovolemic Subjects

Computation of the t statistic for the two-sample test is slightly more complicated than for the one-sample test. In the one-sample test, we divided the mean by its SE. With two samples, each sample has its own SE; we then determine the SE of the difference between the means. Although the derivation of the SE between the means is beyond the scope of this chapter, 44  the denominator of the equation is similar to that for the one-sample test:

Alternate methods are available to calculate the SE of the difference between means, one of which uses a pooled variance instead of the separate variances of equation (15). 44  Despite the different methods for calculation, the SE are similar.

Next, the difference between groups is divided by this SE; this process produces the now familiar t statistic, which is then compared with the critical value from the tables. If the value for t exceeds the critical value, the null hypothesis is rejected; if t is less than the critical value, the null hypothesis cannot be rejected. An instinctive approach to the comparison between the calculated value for t and the critical value is as follows: the numerator is the difference between the mean values, an estimate of the distance between the central location of the two samples. The denominator, shown in equation 15, is the SE of the difference between means, an estimate of the variability within the samples. The ratio of these values estimates how much of the difference between means can be explained by the variability within the samples. If the variability within each of the samples is small, only a small difference between mean values should be sufficient to suggest that a difference exists between the samples. In contrast, if the variability within one or both of the samples is great, the difference between the means of the samples must be larger if the investigator is to have confidence that a difference exists between groups.

Before using a table of t values to determine the critical value for this example, we should consider one special aspect. In our hypothetic situation, we have been comparing blood pressure in normovolemic and hypovolemic individuals. Our a priori assumption is that blood pressure will be lower, rather than higher, in the hypovolemic group. Therefore, rather than perform a two-tailed comparison that would permit us to assess all possible relationships among the data, it would be appropriate to perform a one-tailed test. Because the critical value for t (a = 0.05, df = 14) is lower for a one-tailed comparison (1.76) than for a two-tailed comparison (2.14), we have increased our chances of detecting a statistically significant difference by using the former.

In this example, the t statistic is markedly greater than the critical value; we conclude that it is unlikely ( P < .05) that these two samples were selected from the same population. Therefore, the mean value for blood pressure is lower for these hypovolemic individuals than for normovolemic individuals. Because the t statistic is markedly greater than the critical value for a = 0.05, we can refer to the table to determine the critical values for higher levels of significance. With 14 df, the one-tailed t is 2.62 when a = 0.01 and 2.98 when a = 0.005. Because the t statistic exceeds both these values, we can conclude that the likelihood of these samples being from the same population is extremely small, less than 0.005 or 1 in 200.

Paired-Sample t-Tests

On occasion, an investigator obtains measurements before and after an intervention and then studies whether this intervention produced a significant effect. Under these circumstances, when the two samples being compared are paired, a paired-sample t- test, more commonly called a paired t -test, is used. For example, measuring cardiac output before and after the administration of pancuronium might produce the values shown in Table 21–10. In this instance, an informal appraisal of the data suggests a strong difference between the “before” and “after” values, because five of six subjects had an increase in cardiac output. The paired t-test is used to confirm this observation. A new sample is created whose members are equal to the difference between the “before” and “after” values for each subject. This new sample is then analyzed by the one-sample t-test. The mean value for this sample is 0.80 L/min, and the SE is 0.30 L/min. Therefore, the t statistic is 2.69, a value that exceeds the critical value of 2.01 (again, we can use a one-tailed test because we assume that pancuronium will increase, not decrease, cardiac output). We would conclude that the “before” and “after” measurements are unlikely to be from the same population. This conclusion suggests that pancuronium increases cardiac output.

TABLE 21–10. Application of the Paired t-Test to Hypothetic Values for Cardiac Output Before and After Administration of Pancuronium

An alternate statistical approach to this hypothetic situation would be to use the two-sample t-test (Table 21–11). This test produces a value for t of 1.69, which is less than the critical value of 1.81. Despite the greater degrees of freedom for the two-sample test (10) than for the one-sample test (5), the unpaired test does not support our belief that a difference exists between the “before” and “after” values. This lack of confirmation results because of the variability of the “before” and “after” values. The two-sample test assumes that the investigator obtained the “before” measurements in one group of subjects and the “after” measurements in another group. The paired test is more sensitive to small changes because it assumes that the “before” and “after” values were measured in the same subject. For example, a change in cardiac output from 4 to 6 L/min in one subject means something entirely different than would the measurement of 4 L/min in one subject before pancuronium and the measurement of 6 L/min in a different subject after pancuronium.

TABLE 21–11. Inappropriate Application of the Two-Sample t-Test to the Hypothetic Analysis of Data of Table 21–10

Nonparametric t-Tests: the Mann-Whitney U-Test

Data on an ordinal scale require special treatment, because determining means and variances for this kind of data is usually inappropriate. In order to make statistical comparisons on ordinal data, nonparametric tests, which assess the relative ranks rather than the magnitude of the data, are applied. Most parametric tests have a corresponding nonparametric test.

Nonparametric tests are also valuable for analyzing samples that deviate strongly from the normal distribution. Although parametric tests are based on the assumption that distribution is normal, they are sufficiently powerful (statisticians use the term robust) to detect differences even when samples are not distributed normally. However, as the samples stray significantly from a normal distribution, parametric tests lose their ability to detect differences. Because nonparametric tests analyze only the ranks rather than the individual values, they may detect differences not detected by parametric tests.

The nonparametric test corresponding to the two-sample t-test is the Mann-Whitney U-test. With this test, the values in each of the two groups are assigned ranks. The smallest (or largest) value is assigned the rank of 1; the next smallest (or next largest), the rank of 2. This process continues until the largest (or smallest) value has been assigned the rank equal to the sum of the two sample sizes. If values are tied in rank, they are assigned a value equal to the average of the corresponding ranks. For example, if two samples are tied for ranks 4 and 5, both are assigned the rank of 4.5. The statistics R1 and R2 are equal to the sum of the ranks for groups 1 and 2, respectively. The test statistic, U, is determined by the following equation:

where n1 and n2 are the sizes of the first and second groups, respectively. The value for U is then compared with critical values for U obtained from a table. The data comparing the blood pressures of normovolemic and hypovolemic subjects (see Table 21–9) can be analyzed using the Mann-Whitney U-test, as shown in Table 21–12. As with the two-sample t-test, the results of the Mann-Whitney U-test suggest a difference between the two groups ( P < .05). Nonparametric tests such as the Mann-Whitney U-test rarely appeared in the anesthesia literature until recent years; however, with increased sophistication in research, these tests are being used more frequently. The nonparametric version of the paired t-test is the Wilcoxon paired-sample test.

TABLE 21–12. Nonparametric Comparison of Hypothetic Data from Table 21–9 Using the Mann-Whitney U-Test

Comparing Three or More Groups

The critical values for t are calculated with the assumption that comparisons are being made between only two groups. If the investigator collected data on three groups of subjects, three comparisons would be possible: A versus B, A versus C, and B versus C. If each of these comparisons was made using the two sample t-test and a = 0.05, we would be accepting a 5 percent risk of committing a type I error for each comparison. For three comparisons, the chance of committing a type I error increases to approximately 3 × 5 percent, or 15 percent (actually closer to 14 percent*), a level that is usually considered unacceptable. As the number of groups increases, the number of possible comparisons increases such that, with enough groups, the investigator will eventually uncover a nonexistent difference.

Thus, the t-test is properly used for comparing two groups (or a single group to a predetermined value). When more than two groups are being compared, other tests, particularly analysis of variance, are more appropriate. If the investigator chooses to use the t-test to compare more than two groups, a correction must be made to prevent type I errors. When one such correction, the Bonferroni inequality (or Bonferroni correction), is applied, the a level for each comparison is divided by the number of comparisons to be performed. For example, if the investigator chooses a value of 0.05 for a and three comparisons are possible, a value of 0.05/3 or 0.0167 for a should be used for each of the comparisons. Then, the investigator is able to state that, overall, the chance of committing a type I error is less than 5 percent.

If the Bonferroni inequality is used, the investigator must decide in advance which of the comparisons will be made. For example, if there are four groups, the investigator may choose to compare group I with each of the other groups (e.g., if subjects in group I were given the placebo and subjects in groups II to IV were given one of three different drugs). In this case, only three of the six possible comparisons will be made. However, it is inappropriate to examine the data and then decide a posteriori which comparisons to make.

* For each comparison in which the investigator permits a type I error of a, he or she is (1 – a) confident that no error exists. For n comparisons, he or she is (1 – a) n confident that no error exists; conversely, the probability of a type I error is [1 – (1 – a) n].