Lab 1.8 How well did my drug do? Variance explained and discriminability
Introduction
In the previous few labs, we learned that populations exhibit variability, and we also learned how to calculate whether or not the means of different populations are significantly different from one another. In this lab, we will examine how large the effect of a drug might be, in terms of how much of the experimental variance can be explained by the factors in the experiment. These concepts will allow us to perform statistical tests on whether the means of 3 or more groups are different, and will also provide a basis for fitting data to models.
Variance within groups and between/across groups
Let's start with a simple quantity like height. Consider this data of the heights of boy and girl children:
heights = load('children_heights.txt','-ascii');
This file has 2 columns. The first column indicates each individual's height, and the second column indicates that child's sex (1 for boy, 0 for girl). To get a sense of the structure of this matrix, let's look at, say, the first 10 rows:
heights(1:10,:)
This matrix is fine, but it would be a little easier to look at this data in a table.
heights_t = array2table(heights,'VariableNames',["Height","Sex"]);
heights_t.("Sex") = categorical(heights_t.("Sex"),[0 1],["Girl" "Boy"]); % convert the 0,1 values to strings
Now let's take a look at this table
heights_t
We'd like to identify the boys and girls in this sample to plot them separately. We can do this using the very useful find command. Let's demonstrate the find command's power; we'll use it over and over for various data analysis tasks as the course progresses.
Suppose we have an array of values:
A = [ 1.5 2.5 3.5 4.5 5.5 6.5 ]
We can use the find command to give us the element index values matching particular criteria:
inds = find(A>3)
This returns [3 4 5 6] because the 3rd, 4th, 5th, and 6th element of A are greater than 3. We can break this down a little bit more. Note that
B = A>3
performs an element-by-element comparison for all the values in A. Then, the find command finds all of the entries that are not 0:
inds = find(B)
If we then examine the variable A at the index values indicated by the find command, we will see all of the points greater than 3:
A(inds)
Now we can return to our example at hand. To find all of the index values in our dataset that correspond to boys, we can use:
boys = find(heights_t.("Sex")=="Boy");
% want to unpack this more? type in: heights_t.("Sex")=="Boy"
girls = find(heights_t.("Sex")=="Girl");
Now we can identify the heights of all the girls by accessing only those elements of heights:
heights_t(girls,"Height")
Let's calculate the means and plot the data:
mn_total = mean(table2array(heights_t(:,"Height"))); % compute the mean of whole data set
%%% here
mn_girls = mean(table2array(heights_t(girls,"Height"))); % compute mean of girls
mn_boys = mean(table2array(heights_t(boys,"Height"))); % compute mean of boys
bar([1 2 3],[mn_total mn_girls mn_boys],'w');
hold on
plot(1,table2array(heights_t(:,"Height")),'go'); % plot all heights
plot(2,table2array(heights_t(girls,"Height")),'rd'); % plot girl heights
plot(3,table2array(heights_t(boys,"Height")),'bs'); % plot boy heights
Q1: Are the differences in the means significant? We can check that out:
[h,p_t] = ttest2(table2array(heights_t(girls,"Height")),table2array(heights_t(boys,"Height")))
Now, how much of the variation in this data can be explained by sex?
The total variation that we observed is simply the sample variation multiplied by the number of samples:
variance_per_point=var(table2array(heights_t(:,"Height")),1); %var(x,1) is sample variance, Lab 1.7
number_points = length(table2array(heights_t(:,"Height")));
total_variance = number_points * variance_per_point
This total variance is the sum of the variance that occurs 1) within the 2 groups, and 2) between (or across or among) the 2 groups.
We can calculate the variance within each group (that is, for each sex) as the following:
girl_variance = length(girls) * var(table2array(heights_t(girls,"Height")),1)
boy_variance = length(boys) * var(table2array(heights_t(boys,"Height")),1)
So the total variance "within" groups is:
within_variance = boy_variance + girl_variance
Finally, we can calculate the variance that exists between the group means. This is calculated as a weighted average of the squared difference between each group's mean and the "grand mean" of all data points:
between_variance=length(boys)*(mn_boys-mn_total)^2+length(girls)*(mn_girls-mn_total)^2
The total variance is the sum of the variation within groups plus the variation between/across groups:
total_variance - (within_variance + between_variance) %should==0, up to rounding
So we can calculate the percentage of variance that is "explained" by group membership, that is, how much variance is explained by the variance between groups:
explained_variance = 100 * between_variance/total_variance
unexplained_variance = 100 * within_variance/total_variance
Q2: How much of the variation in this population is explained by sex? If you know the sex of an individual, do you have a lot of information about his/her height?
Q3: Is the unexplained variance a result of variation in the population, measurement noise, or both?
Analysis of variance (ANOVA)
In a neat trick, we can use the between group variance and the within group variance to independently estimate the variance of the "true" distribution:
If there are not significant differences across the means of the groups, then these values should be equal. The mathematician RA Fisher developed a test, called the F test in his honor, that provides a probability value that the ratio of these 2 variances is less than or equal to 1:
Just like the other statistical tests we have seen in the class, one can calculate the value of F, and compare your value to a reference distribution with the same number of groups and individual data points to examine the probability that any difference is just due to sampling. This technique is called the Analysis of Variance, abbreviated ANOVA. Matlab can perform an anova using the function anova1 (see help anova1).
Let's try Matlab's function:
[p_a,table]=anova1(table2array(heights_t(:,"Height")),table2array(heights_t(:,"Sex")))
Notice that the table that is generated includes the total variance, within group variance, and between group variances in the column 'SS' (which stands for "Sum of squares"). So, in the future, you can use Matlab's function for this purpose rather than writing out the formulas.
Q4: Does the p value p_a indicate that the differences in means between the groups are significant? How does the value of p_a compare to the value returned from the t test p_t?
ANOVA and groups of 3 or larger
Okay, so why do we need a second test that apparently does the same thing as the t test? The reason is that the ANOVA can do even more than the t test. Suppose we want to evaluate whether the means of 3 or more groups are significantly different. We can't use the t test for this purpose. Why?
Suppose we sample the same distribution twice. If we perform a t_test, we expect the T statistic to register a P value of less than 0.05 about 5% of the time. We can simulate this as follows.
outcomes = [];
i = 1;
while i<10000,
A = randn(5,1); % 5 normally-distributed points, mean 0 std dev 1
B = randn(5,1); % 5 points from same distribution
[h,outcomes(i)] = ttest2(A,B);
i = i + 1;
end;
num_sig = length(find(outcomes<0.05));
100*num_sig/10000 % should be about 5%
Now suppose we have 4 groups (maybe we tested 4 different treatments), and we want to know if the mean of any of these 4 groups is different from the mean of any other. One naive thing to do might be to do all combinations of t tests. Let's do this to show that it is the wrong thing to do. We will look for any evidence of a significant t test among the groups by using a relational operator OR which is designated by a vertical slash:
p1 = 0.04;
p2 = 0.5;
outcome = p1<0.05 | p2<0.05 % returns 1 for true since p1<0.05 OR p2<0.05
outcome = p1<0.01 | p2<0.01 % returns 0 for false since neither expression is true
Now let's try our simulation:
outcomes_dumb = [];
outcomes_anova = [];
G = [1;1;1;1;1;2;2;2;2;2;3;3;3;3;3;4;4;4;4;4]; % group labels for our data
i=1;
while i<10000,
A = randn(5,1); % 5 normally-distributed points, mean 0 std dev 1
B = randn(5,1); % 5 points from same distribution
C = randn(5,1); % 5 points from same distribution
D = randn(5,1); % 5 points from same distribution
[h,p1] = ttest2(A,B);
[h,p2] = ttest2(A,C);
[h,p3] = ttest2(A,D);
[h,p4] = ttest2(B,C);
[h,p5] = ttest2(B,D);
[h,p6] = ttest2(C,D);
outcomes_dumb(i) = p1<0.05|p2<0.05|p3<0.05|p4<0.05|p5<0.05|p6<0.05;
outcomes_anova(i) = anova1([A;B;C;D;],G,'off'); % turn table off
i = i + 1;
end;
num_sig_dumb = length(find(outcomes_dumb>0));
100*num_sig_dumb/10000 % should be about 5%
num_sig_anova = length(find(outcomes_anova<0.05));
100*num_sig_anova/10000 % should be about 5%
Q5: How many simulations were "significant" for the naive method of successive t-tests with an alpha of 5%? Was this percentage close to 5% or was it a lot greater? Why do you think this is? What fraction of simulations were "significant" for the ANOVA test? Said another way, how many chances did the data have to pass the 5% t-tests? How many chances should the data "deserve" to pass a 5% statistical test?
Assumptions of the ANOVA
Like the T-test, the ANOVA assumes that all of the underlying variables are normally-distributed. The ANOVA is probably still okay to use when the data are "normal-like"; that is, they have a strong central tendency, and they don't hit any hard limits (like 0 or some upper bound of an index).
Discriminability
If you know a person's height, can you guess whether they are a boy or a girl? How often would you be right? How often would you be wrong?
Let's look at the cumulative histograms for boys and girls.
[X1,Y1] = dasw.plot.cumhist(heights(boys,1));
[X2,Y2] = dasw.plot.cumhist(heights(girls,1));
figure;
plot(X1,Y1,'b');
hold on;
plot(X2,Y2,'r');
axis([40 75 0 100]);
ylabel('Percent of data');
xlabel('Heights (inches)');
legend('Boys','Girls');
If you were given someone's height, you would probably look at the cumulative histogram and try to make a guess as to whether the subject were a boy or a girl. If the height was less than 60 inches, you'd probably just guess randomly, since there is no difference between boys and girls for those heights; but, if the height were greater than 60, you'd probably guess boy, and you'd be right slightly more often than you were wrong.
What is the likelihood that you guess correctly overall? Let's look at the height 70. Only about 5% of individuals are taller than 70, so you wouldn't be asked about this very often; but, if you were, you'd guess boy, and you'd be right more often than not. If you were asked about an individual 70 inches tall, from the cumulative histogram you can see that there is about a 6.8% chance that the person is a boy (and your guess was right), and a 0.3% chance the person is a girl (and your guess was wrong). If we repeat this game many times for different heights, and put these percentages on the X and Y axis, respectively, we obtain what is called a discrimination curve. The function roc_analysis.m provided here computes this curve; we'll introduce the function and then study how it works. Put it in your +dasw/+stats package.
function roc_analysis
function [discrim,TPR,FPR,Xvalues,confusion,sample1CUM,sample2CUM] = roc_analysis(sample1,sample2)
% ROC_ANALYSIS - Receiver operating characteristics
% [DISCRIM,PROB_TRUE_ACCEPT, PROB_FALSE_ACCEPT, XVALUES,...
% CONFUSION,SAMPLE1CUM,SAMPLE2CUM]=dasw.stats.roc_analysis(SAMPLE1,SAMPLE2)
%
% Performs ROC analysis to see how sensitivity and false positives trade
% off if the task is to say which distribution a given value is likely
% to have arisen from.
%
% (see http://en.wikipedia.org/wiki/Receiver_operating_characteristic)
%
% For each value of X present in the samples SAMPLE1 and SAMPLE2,
% ROC returns the resulting probability of a "true accept" (we say
% the value is from SAMPLE2 and it is) and the resulting
% probability of a "false accept" (we say the sample is from SAMPLE2
% but it's really from SAMPLE1).
%
% Inputs: SAMPLE1 - an array of sample data
% SAMPLE2 - an array of sample data
% Outputs: DISCRIM - Likelihood of discriminating the 2 distributions
% (accuracy) for each possible threshold X
% TRUE_POSITIVE_RATE - The rate a true accept (sensitivity) (TP_x./(TP_x+FN_x))
% FALSE_POSITIVE_RATE - The rate of a false accept (FP_x./(FP_x+TN_x))
% XVALUES - The X-axis values for the cumulative density functions
% CONFUSION - A structure with the "confusion matrix" assuming each value of
% X is used as a threshold. It has fields:
% confusion.TP_x: (true positive)
% likelihood sample is X or greater and comes from sample 2
% confusion.FN_x: (false negative)
% likelihood sample is less than X and comes from sample 2
% TN_X: (true negative)
% likelihood sample is less than X and comes from sample 1
% confusion.FP_x: (false positive)
% likelihood sample is X or greater and comes from sample 1
% SAMPLE1CUM - The cumulative sums of SAMPLE1
% SAMPLE2CUM - The cumulative sums SAMPLE2
sample1 = sample1(:);
sample2 = sample2(:);
bin_edges = [unique([min([sample1;sample2])-1;sample1; sample2])];
bin_counts1 = histc(sample1, bin_edges);
bin_counts2 = histc(sample2, bin_edges);
Xvalues = bin_edges;
sample1CUM = cumsum(bin_counts1);
sample2CUM = cumsum(bin_counts2);
confusion.TP_x = (sum(bin_counts2)-sample2CUM) ./ (sum(bin_counts1)+sum(bin_counts2));
confusion.FN_x = [0;sample2CUM(2:end)./(sum(bin_counts1)+sum(bin_counts2))];
TPR = confusion.TP_x ./ (confusion.TP_x+confusion.FN_x);
confusion.FP_x = (sum(bin_counts1)-sample1CUM) ./ (sum(bin_counts1)+sum(bin_counts2));
confusion.TN_x = [0;sample1CUM(2:end)./(sum(bin_counts1)+sum(bin_counts2))];
FPR = confusion.FP_x./(confusion.TN_x+confusion.FP_x);
discrim = ...
(confusion.TP_x+confusion.TN_x)./ ...
(confusion.TP_x+confusion.FN_x+confusion.FP_x+confusion.TN_x);
[discrim,pta,pfa,Xvalues] = dasw.stats.roc_analysis(heights(girls,1),heights(boys,1));
figure;
plot(pfa,pta,'b'); % p of wrong guesses vs. correct guesses
hold on
plot([0 1],[0 1],'k--'); % line y=x
xlabel('Prob of incorrectly calling a girl a boy');
ylabel('Prob of correctly declaring a boy a boy');
The area under this curve indicates the percentage of time that you'd guess correctly if you played the game many many times. We can compute the accuracy of using each value to try to classify an unknown sample into one of these 2 distributions for each threshold. The answer is in the variable discrim:
[Xvalues(:) discrim(:)]
In the left column, the function shows a possible dividing threshold (if the height is Xvalues(i) or greater then we say the sample is most likely to have arisen from distribution 2, girls in this case). In the right column is the fraction of the time you'd be able to guess correctly given the statistics of these 2 samples if you used that threshold.
Q6: What fraction of the time would you expect to be able to correctly identify the sex of a child given his/her height, assuming you use the best dividing threshold?
Let's perform the integration under a very small curve ourselves to see how this is done. Let's calculate the width of each rectangle. We'll use the function diff, which takes the difference of sequential elements in a vector (see help diff):
Suppose the values on the X axis and Y axis were
Xvals = [ 5 7 10]
Yvals = [ 3 5 6]
then we could estimate the rectangle widths and heights by
rect_widths = diff(Xvals)
rect_heights = (Yvals(1:end-1)+Yvals(2:end))/2, % average successive elements
Now we'll calculate the area under the curve by matrix multiplication
rect_heights * rect_widths',
If you read the function roc_analysis, you'll notice this is exactly what is done.
Read the roc_analysis.m function; see if you can understand its steps.
Function reference
Matlab functions and operators
User functions
dasw.stats.roc_analysis.m
Copyright 2011-2012 Stephen D. Van Hooser, all rights reserved.