Lab 1.6 What was the impact of my drug on average?
Introduction
In the last lab, we examined the nuts and bolts of a typical statistical test. In this case, as in most cases, the procedure involved calculating the likelihood that 2 samples of data were the same. That is, we posited the "null hypothesis" was true, that our drug did nothing. We did this empirically first, and then saw that there was an equation that the mathematician Kolmogorov had developed that agreed very closely with our empirical experiments on the computer. Finally, we calculated the likelihood that 2 experimental data sets were actually 2 samples of the same underlying distribution; this was very unlikely, so we concluded that the null hypothesis was false (that is, the 2 samples really were samples of different underlying true distributions).
Many statistical tests can be expressed in the form of a "recipe" that 1) describe the situations where it can be used, 2) describes the behavior of the test statistic when the underlying data are the same (that is, when the null hypothesis is true). Let's look at a sample recipe for the Kolmogorov-Smirnov test:
Kolmogorov-Smirnov test recipe
Question answered: Are 2 random samples S1 and S2 derived from the same underlying true distribution?
Statistic calculated: The ks_statistic is the maximum difference between the empirical cumulative histograms for S1 and S2.
How the statistic is distributed when the null hypothesis is true (that is, how it is distributed when the 2 samples are derived from the same underlying distribution): As dasw.stats.ks2_cdf(N1, N2, ks_statistic), where N1 and N2 are the number of samples of S1 and S2, respectively
How to perform this test in Matlab: If S1 and S2 are variables with N1 and N2 random samples, then one can perform the K-S test with [h,pvalue] = kstest2(S1,S2); If pvalue is less than the critical value alpha (such as 0.05), then we can reject the null hypothesis and say with 1-alpha confidence (such as 95% confidence) that S1 and S2 are samples of different underlying distributions.
Now let's follow this recipe on the data set we used last time:
sample1 = dasw.stats.generate_random_data(20,'normal',0,1);
sample2 = dasw.stats.generate_random_data(20,'normal',2,1);
[X1,Y1] = dasw.plot.cumhist(sample1);
[X2,Y2] = dasw.plot.cumhist(sample2);
figure;
plot(X1,Y1,'r-');
hold on
plot(X2,Y2,'b-');
axis([-10 10 0 100]);
xlabel('values of 2 samples of size 20');
ylabel('Percent of data');
Now that we understand the idea behind the statistical test, we can calculate it with the built-in function:
[h,pvalue] = kstest2(sample1,sample2)
Q1: What is pvalue? Can you reject the null hypothesis? That is, are these samples likely to have been derived from different underlying distributions?
In today's lab, we will look at how we can determine the mean of a true distribution with some confidence, and to examine whether means across 2 distributions are statistically different.
What is the mean value of my distribution?
Often in science, one is very interested in knowing what happened in an experiment "on average"; that is, we are interested in knowing the "true" mean of the "true" distribution of interest, or to know how the means of 2 distributions compare to each another.
Of course, if we have a set of samples, then calculating the mean of those samples (called the sample mean) is quite easy (this is the "mean" or "average" that you might have learned in high school). It is simply:
Sm = (S1 + S2 + S3 + ... + SN)/N, where N is the number of points in your sample.
We are typically interested in determining, with some confidence, the "true" mean of the distribution. Fortunately, we can learn a great deal about this "true" mean through sampling.
To build our intuition as to how sampling can tell us about the true mean, we'll do some virtual experiments on the computer. Suppose the following distribution of data indicates, for 2010, the exact birth date of the first litter of young squirrels born to all 20,000 mature female squirrels in the city of Waltham, MA:
Note that here we are assuming that this is the "true distribution" of squirrel births. Of course, if we were scientists interested in understanding the mean birthday of squirrels in this year, we would be unable to secure funding to monitor the nests of 20,000 mature female squirrels, both because it would be really expensive, but also because it is unnecessary, thanks to sampling. Instead, suppose we set out to equip 30 randomly-chosen squirrel nests with video equipment in order to observe squirrel births and record the date of occurrence. We want to find the average birth day for this population.
How much will our 30 squirrel nests tell us about the population mean? To evaluate this, we will model our experiment in 2 steps:
We will randomly draw 30 points from the true distribution to comprise our sample.
We will calculate the sample mean Sm from this sampled data.
We'll repeat this modeled experiment 1000 times, to look at what the sample mean tends to look like for this kind of experiment. To do this, we want to grab 30 samples at random. 1 way to do this is to use the ready-made function randperm, that will produce a randomly scrambled list of the numbers 1 to M; further, we can use the built-in function length to find out the number of points M in a variable. Let's try this with a small example. Suppose we want to grab 3 data points at random from a list of 6 data points:
S = [ 0.5 7 10 4 3 5 ]; % a data set with 6 elements
number_of_elements = length(S) % length(S) returns the number of elements of S
R = randperm(number_of_elements) % returns a random permutation
R(1:3), % the first 3 elements of the random permutation
newsample = S(R(1:3)) % the new sample of 3 points from the data set
Okay, now let's model our experiment. We'll put this in an m file so we can repeat it with some other distributions easily. We'll also write a function autohistogram.m to make automatic guesses as to a nice number of bins to use for plotting the histogram. Put simulate_random_sampling.m in your +dasw/+stats package and autohistogram.m in your +dasw/+plot package.
simulate_random_sampling.m
function [samplemeans, true_mean] = simulate_random_sampling(true_d, N, M)
% SIMULATE_EXPERIMENTAL_TRIAL - Simulate a random sampling experiment
% [SAMPLEMEANS, TRUEMEAN]=dasw.stats.simulate_random_sampling(TRUE_D, N, M, PLOTIT)
%
% Inputs: TRUE_D: the "true distribution"
% N : the number of samples in each experiment
% M : the number of experiments to simulate
% PLOTIT: 0 or 1, if we should plot the distribution of M sample means
% Outputs:
% SAMPLEMEANS : The distribution of M sample means (1 for each experiment)
% TRUEMEAN : The true mean; that is, the mean of the true distribution
true_mean = mean(true_d); % we know the answer!
i = 1;
samplemeans = []; % begin with the sample means empty
while (i<=M),
randompermutation = randperm(length(true_d));
sample = true_d(randompermutation(1:N));
samplemeans(i) = mean(sample);
i = i+1;
end;
figure;
[N,bin_centers] = dasw.plot.autohistogram(samplemeans);
bar(bin_centers,100*N/sum(N));
xlabel('X variable');
ylabel('Percent of data');
hold on;
[X,Y] = dasw.plot.cumhist(samplemeans);
plot(X,Y,'r-'); % red line for cumulative histogram
% plot the true mean as a vertical bar
plot([true_mean true_mean], [0 100],'k--','linewidth',2);
axis([bin_centers(1) bin_centers(end) 0 100]);
autohistogram.m
function [counts,bin_centers] = autohistogram(data)
%AUTOHISTOGRAM - Choose bins based on Freedman–Diaconis' choice
% [COUNTS,BIN_CENTERS] = daws.plot.autohistogram(DATA)
% Automatically chooses bin sizes based on Freedman-Diaconis' choice,
% defined to be
% WIDTH = 2*IQR(DATA)/CUBE ROOT OF NUMBER OF DATAPOINTS
% (see Histogram on Wikipedia)
% Inputs: DATA, a set of samples
% Outputs: COUNTS, the number of DATA samples in each bin
% BIN_CENTERS - the center location of each bin
%
bin_min = min(data);
bin_max = max(data);
num_datapoints = length(data);
bin_width = 2*iqr(data)/power(num_datapoints,1/3); % use Freedman-Diaconis
bin_edges = (bin_min-bin_width):bin_width:(bin_max+bin_width);
bin_centers = (bin_edges(1:end-1) + bin_edges(2:end))/2;
counts = histc(data,bin_edges);
counts = counts(1:end-1); % remove the last bin that is returned by histc
Now let's examine the distribution of sample means that we might obtain in the experiment that we designed:
squirrel_births = load('squirrel_births.txt','-ascii');
dasw.stats.simulate_random_sampling(squirrel_births, 30, 1000);
title('The sample mean of squirrel birth days for 1000 sampling experiments');
Q2: Is the distribution of sample means that we would obtain for experiments of this type broad or narrow? Is the distribution of sample means clustered around the true mean, or is it all over the place? Based on the cumulative histogram, what is the value we would have obtained for the sample mean in the 2.5th percentile of all experiments? What is the value we would have obtained for the 97.5th percentile? The low value (the X value at 2.5 percentile) and the high value (the X value at 97.5th percentile) together comprise the 95% confidence intervals for the true mean; we can be 95% sure that the true mean lies within the confidence intervals.
Dependence of the confidence interval of the mean on the number of samples
Now let's examine the dependence of our confidence in the mean based on the number of samples in our experiment. Let's try values of 10 samples, 30 samples, and 60 samples:
dasw.stats.simulate_random_sampling(squirrel_births, 10, 1000);
title('Sample means of squirrel birth days for 1000 experiments, 10 samples');
dasw.stats.simulate_random_sampling(squirrel_births, 30, 1000);
title('Sample means of squirrel birth days for 1000 experiments, 30 samples');
dasw.stats.simulate_random_sampling(squirrel_births, 60, 1000);
title('Sample means of squirrel birth days for 1000 experiments, 60 samples');
Q3: What are the 95% confidence intervals for 10 samples and 60 samples? Does having more samples increase our certainty regarding the mean?
Dependence of the shape of the distribution of sample means on the type of true distribution
The distribution of sample means is tightly clustered around the true mean for the data above; however, is this true for all types of data? Consider the true distribution of human birthdays in Massachusetts in 2010 (or, what it might be; this data is made up) or, the number of heads that occur in 1000 coin flips for 10,000 different experiments. These true distributions have a very different shape from the true distribution of squirrel births; the human birthdays are very close to a uniform distribution, with all values equally likely, while the coin flips counts form a normal distribution, or "bell curve" distribution that we'll introduce formally in short order.
Now let's imagine we performed sampling experiments to estimate the mean values of these distributions.
human_births = load('human_births.txt','-ascii');
dasw.stats.simulate_random_sampling(human_births, 30, 1000);
title('Sample means of birth days for 1000 sampling experiments, 30 samples');
heads_coinflips = load('heads_coinflips.txt','-ascii');
dasw.stats.simulate_random_sampling(heads_coinflips, 30, 1000);
title('Sample means of number of heads for 1000 sampling experiments, 30 samples');
Q4: Are the sample means of the human births as tightly clustered as the sample means for the squirrel births? Do the sample means exhibit any clustering? If the distribution of sample means for the human births is wider, does it have the same basic shape as that of squirrel births or numbers of heads?
The normal distribution and the central limit theorem
There is a special type of distribution that arises frequently in both mathematics and in the natural sciences, and this is the normal distribution, or the bell curve distribution. The distribution of sample means in our 1000 sampling experiments above has a normal distribution. This is the result of a beautiful theory in mathematics called the central limit theorem. For our purposes, the central limit theorem can be summarized as follows:
If random samples of a fixed N are drawn from any population (regardless of the form of the population distribution), as N becomes larger, the distribution of sample means approaches normality, with the overall mean Sm approaching the true mean µ, and a standard error sigma_Sm that is equal to the true standard deviation divided by sqrt(N). (Paraphrased from Runyun)
The equation for the normal distribution, where the mean of the distribution is µ and the standard deviation is sigma, and ∆x the resolution of the X axis, is as follows:
The central limit theorem is good news, because it allows us to estimate our confidence in the mean of a distribution from a single sampling experiment (instead of 1000 in our example above). By letting µ be our sample mean Sm, and letting sigma be the standard error of the mean (calculated as the standard deviation divided by sqrt(N), where N is the number of points in our sample), we can obtain confidence around our estimate of the mean. Let's try it.
First, let's remind ourselves of our results from the 1000 sampling experiments:
dasw.stats.simulate_random_sampling(squirrel_births, 30, 1000);
title('Sample means of squirrel birth days for 1000 experiments, 30 samples');
Now let's conduct a single sampling experiment:
R = randperm(length(squirrel_births));
S_30 = squirrel_births(R(1:30));
Sm = mean(S_30);
S_standarddeviation = std(S_30); % std calculates the standard deviation
Std_error = S_standarddeviation/sqrt(30);
Now, based on the central limit theorem, we predict that, if we had done many many experiments (like 1000), the distribution of sample means should look something like this:
figure;
X = 1:200;
dX = 1;
Nus = dX*exp(-(power(X-Sm,2)/(2*power(Std_error,2))))/sqrt(2*pi*power(Std_error,2));
cumulative = cumsum(Nus);
hold on
plot(X,Nus,'g-');
plot(X,cumulative,'g--');
Q5: Based on the distribution of means predicted by the central limit theorem, what are the 95% confidence intervals of the mean? (Use the 2.5% and 97.5% limits of the cumulative distribution of the sample means that we just created.)
The differences between 2 means sampled twice are distributed as the T distribution
Finally, we are interested in knowing when 2 distributions have different means. As usual, we rely on the fact that mathematicians can calculate what the differences should be when the mean of a distribution is sampled twice. We won't go over this formula in class, but it is in your text book. Instead, we will merely cover how to implement the test and give the recipe.
Just as with the Kolmogorov-Smirnov test, we can calculate whether this difference is greater than expected by comparing the actual difference to a predicted difference. This test, called the t-test, is built right into the Matlab's statistics toolbox (the ttest2 function, see help ttest2).
We will generate 2 samples of the same underlying distribution by again sampling 30 points from the squirrel_births true distribution, and perform a t-test to see whether the means are significantly different from what is expected.
R2 = randperm(length(squirrel_births));
S2_30 = squirrel_births(R2(1:30));
[h,pvalue] = ttest2(S_30,S2_30);
pvalue,
The pvalue indicates the likelihood that the 2 means are the same. In this example, the pvalue is likely to be rather high (greater than 0.05, for example). If we were to perform this test on data for which the means were different, then the pvalue would be low. If we wanted to have a confidence level of 0.05, then we would say that the means were significantly different if the pvalues are less than 0.05.
Note: While our confidence in the mean that we calculated above using the central limit theorem is general for any type of data, calculating the significance of the difference of means using the T distribution assumes that the underlying true distributions are "normal" distributions. In practice, however, this procedure works well for data that are approximately normally distributed (that is, where most of the data points lie in the middle of the distribution, and there are no hard edges or thresholds).
T-test of 2 sample means recipe
Question answered: Are the means of 2 sample distributions S1 and S2 identical? Technically, we assume that S1 and S2 are samples of quantities that are normally distributed, however the test has been shown to still work for most data that has some central tendency and lacks hard thresholds or edges.
Statistic calculated: The T statistic (formula in Baldi).
How the statistic is distributed when the null hypothesis is true (that is, how it is distributed when the 2 samples have identical means): As T2_CDF(N1, N2, T_statistic), where N1 and N2 are the number of samples of S1 and S2, respectively
How to perform this test in Matlab: If S1 and S2 are variables with N1 and N2 random samples, then one can perform the T test with [h,pvalue] = ttest2(S1,S2); If pvalue is less than the critical value alpha (such as 0.05), then we can reject the null hypothesis and say with 1-alpha confidence (such as 95% confidence) that S1 and S2 have different means.