Lab 1.8 How well did my drug do? Variance explained and discriminability
In the previous few labs, we learned that populations exhibit variability, and we also learned how to calculate whether or not the means of different populations are significantly different from one another. In this lab, we will examine how large the effect of a drug might be, in terms of how much of the experimental variance can be explained by the factors in the experiment. These concepts will allow us to perform statistical tests on whether the means of 3 or more groups are different, and will also provide a basis for fitting data to models.
Variance within groups and between/across groups
Let's start with a simple quantity like height. Consider this data of the heights of boy and girl children:
This file has 2 columns. The first column indicates each individual's height, and the second column indicates that child's sex (1 for boy, 0 for girl). To get a sense of the structure of this matrix, let's look at, say, the first 10 rows:
We'd like to identify the boys and girls in this sample to plot them separately. We can do this using the very useful find command. Let's demonstrate the find command's power; we'll use it over and over for various data analysis tasks as the course progresses.
Suppose we have an array of values:
We can use the find command to give us the element index values matching particular criteria:
This returns [3 4 5 6] because the 3rd, 4th, 5th, and 6th element of A are greater than 3. We can break this down a little bit more. Note that
performs an element-by-element comparison for all the values in A. Then, the find command finds all of the entries that are not 0:
If we then examine the variable A at the index values indicated by the find command, we will see all of the points greater than 3:
Now we can return to our example at hand. To find all of the index values in our dataset that correspond to boys, we can use:
Now we can identify the heights of all the girls by accessing only those elements of heights:
Let's calculate the means and plot the data:
Q1: Are the differences in the means significant? We can check that out:
Now, how much of the variation in this data can be explained by sex?
The total variation that we observed is simply the sample variation multiplied by the number of samples:
This total variance is the sum of the variance that occurs 1) within the 2 groups, and 2) between (or across or among) the 2 groups.
We can calculate the variance within each group (that is, for each sex) as the following:
So the total variance "within" groups is:
Finally, we can calculate the variance that exists between the group means. This is calculated as a weighted average of the squared difference between each group's mean and the "grand mean" of all data points:
The total variance is the sum of the variation within groups plus the variation between/across groups:
So we can calculate the percentage of variance that is "explained" by group membership, that is, how much variance is explained by the variance between groups:
Q2: How much of the variation in this population is explained by sex? If you know the sex of an individual, do you have a lot of information about his/her height?
Q3: Is the unexplained variance a result of variation in the population, measurement noise, or both?
Analysis of variance (ANOVA)
In a neat trick, we can use the between group variance and the within group variance to independently estimate the variance of the "true" distribution:
If there are not significant differences across the means of the groups, then these values should be equal. The mathematician RA Fisher developed a test, called the F test in his honor, that provides a probability value that the ratio of these 2 variances is less than or equal to 1:
Just like the other statistical tests we have seen in the class, one can calculate the value of F, and compare your value to a reference distribution with the same number of groups and individual data points to examine the probability that any difference is just due to sampling. This technique is called the Analysis of Variance, abbreviated ANOVA. Matlab can perform an anova using the function anova1 (see help anova1).
Let's try Matlab's function:
Notice that the table that is generated includes the total variance, within group variance, and between group variances in the column 'SS' (which stands for "Sum of squares"). So, in the future, you can use Matlab's function for this purpose rather than writing out the formulas.
Q4: Does the p value p_a indicate that the differences in means between the groups are significant? How does the value of p_a compare to the value returned from the t test p_t?
ANOVA and groups of 3 or larger
Okay, so why do we need a second test that apparently does the same thing as the t test? The reason is that the ANOVA can do even more than the t test. Suppose we want to evaluate whether the means of 3 or more groups are significantly different. We can't use the t test for this purpose. Why?
Suppose we sample the same distribution twice. If we perform a t_test, we expect the T statistic to register a P value of less than 0.05 about 5% of the time. We can simulate this as follows.
Now suppose we have 4 groups, and we want to know if the mean of any of these 4 groups is different from the mean of any other. One naive thing to do might be to do all combinations of t tests. Let's do this to show that it is the wrong thing to do. We will look for any evidence of a significant t test among the groups by using a relational operator OR which is designated by a vertical slash:
Now let's try our simulation:
Q5: How many simulations were "significant" for the naive method of successive t-tests with an alpha of 5%? Was this percentage close to 5% or was it a lot greater? Why do you think this is? What fraction of simulations were "significant" for the ANOVA test? Said another way, how many chances did the data have to pass the 5% t-tests? How many chances should the data "deserve" to pass a 5% statistical test?
Assumptions of the ANOVA
Like the T-test, the ANOVA assumes that all of the underlying variables are normally-distributed. The ANOVA is probably still okay to use when the data are "normal-like"; that is, they have a strong central tendency, and they don't hit any hard limits (like 0 or some upper bound of an index).
If you know a person's height, can you guess whether they are a boy or a girl? How often would you be right? How often would you be wrong?
Let's look at the cumulative histograms for boys and girls.
If you were given someone's height, you would probably look at the cumulative histogram and try to make a guess as to whether the subject were a boy or a girl. If the height was less than 60 inches, you'd probably just guess randomly, since there is no difference between boys and girls for those heights; but, if the height were greater than 60, you'd probably guess boy, and you'd be right slightly more often than you were wrong.
What is the likelihood that you guess correctly overall? Let's look at the height 70. Only about 5% of individuals are taller than 70, so you wouldn't be asked about this very often; but, if you were, you'd guess boy, and you'd be right more often than not. If you were asked about an individual 70 inches tall, from the cumulative histogram you can see that there is about a 6.8% chance that the person is a boy (and your guess was right), and a 0.3% chance the person is a girl (and your guess was wrong). If we repeat this game many times for different heights, and put these percentages on the X and Y axis, respectively, we obtain what is called a discrimination curve. The function roc_analysis.m provided here computes this curve; we'll introduce the function and then study how it works.
The area under this curve indicates the percentage of time that you'd guess correctly if you played the game many many times. We can compute the accuracy of using each value to try to classify an unknown sample into one of these 2 distributions for each threshold. The answer is in the variable discrim:
In the left column, the function shows a possible dividing threshold (if the height is Xvalues(i) or greater then we say the sample is most likely to have arisen from distribution 2, girls in this case). In the right column is the fraction of the time you'd be able to guess correctly given the statistics of these 2 samples if you used that threshold.
Q6: What fraction of the time would you expect to be able to correctly identify the sex of a child given his/her height, assuming you use the best dividing threshold?
Let's perform the integration under a very small curve ourselves to see how this is done. Let's calculate the width of each rectangle. We'll use the function diff, which takes the difference of sequential elements in a vector (see help diff):
Suppose the values on the X axis and Y axis were
then we could estimate the rectangle widths and heights by
Now we'll calculate the area under the curve by matrix multiplication
If you read the function roc_analysis, you'll notice this is exactly what is done.
Read the roc_analysis.m function; see if you can understand its steps.