Lab 1.7 Basic sample descriptors; Tinkering with Plots
In the last 2 labs, we have learned the concepts underlying basic statistical tests, and learned how to perform a few basic statistical procedures: we learned to test whether 2 samples are likely to be derived from the same true distributions (Kolmogorov-Smirnov test), how to establish confidence intervals around the mean (central limit theorem), and how to examine whether the means of 2 normal-ish distributions are the same or different (T-test of 2 samples).
In today's lab, we will focus more deeply on understanding common descriptors that are used with sampled data. In the second half of the lab, we'll have some fun learning more about Matlab's plotting and graphics environment.
In Lab 1.5, we saw that the process of statistical inference involved sampling from a "true" distribution that we can never know exactly, and then making inferences about the true population based on the sample. Here, we will discuss some basic descriptors of distributions, and how they are related to one another.
Measures of variation / deviation
Just about any phenomenon that we would want to study exhibits variation. (If there were no variation, then there would be nothing to study, really.) It's important to recognize that there are many sources of variation.
Let's imagine a variable like "height of people in the world who are older than 25 years of age". Everyone in the world is not equally tall. Due to genetic and environmental factors both known and unknown to science, height varies from person to person.
One could imagine documenting this variation in a number of ways, some of which we've already seen. We've looked at percentiles, histograms, and, my favorite, the cumulative histogram, which allows one to read off the entire structure of a whole distribution by examining all percentiles of the data.
The most commonly reported single measures of variation are the variance and its companion measure, the standard deviation. The variance and standard deviation measure of how each value of a distribution deviates from the mean.
Recall that the mean (represented by P with a bar over it) is just the numerical average of all values of the population (enumerated by P1, P2, P3, ... PM, where M is the total number of members of the population).
The variance is the average squared deviation of each point from the mean, and the standard deviation is just the square root of the variance:
Note that the variance has the units of the population value squared. So, in our example of heights of the world, the variance has the units of meters2, whereas the standard deviation has the same units as the population value (in this case, meters).
You may ask yourself, why is the square of these deviations something useful to compute? Why not calculate the absolute value of the difference between each sample and the mean, or the cube...why do people do this? There are several reasons, but one is that many variables in nature and in mathematics are Normally distributed, and a Normally-distributed variable (that is, one that follows the Normal distribution we saw in Lab 1.6) can be described completely by its mean and variance/standard deviation. When we turn to fitting a few labs down the road, we'll see another advantage to using the squared difference.
Inferring the mean and standard deviation of the "true" distribution from a sample
Suppose we have a sample of real data points that we'll call S, and we'll call the individual sample points S1, S2, ... Sn. ; then it is entirely possible to calculate the mean and standard deviation from the sample data points. To indicate that the source of these values are different from the "true" population in the equations above, we'll calculate analogous quantities with different names.
There is the sample mean S bar (S with a bar over it), the sample variance (s2), and sample standard deviation (s):
But, most of the time we actually want to infer the mean and standard deviation of the true distribution from which the samples are derived. S bar is once again the estimate of the mean of the true distribution, but, amazingly, the equation for the estimate of the "true" distribution variance/standard deviation is slightly different:
What is the reason for the difference? Why do we divide by N-1 instead of N to estimate the variance of the true distribution/population? The explanation is a little mathy, but if you're interested, it's right here. The short short version is that there is some uncertainty in the "true" mean that is unaccounted for by dividing by N (that's not really an explanation).
With an estimate of the sample standard deviation in hand, we can estimate the standard error of the mean, which we saw in the last lab allows us to estimate confidence intervals around the sample mean (that is, it answers how much we expect the sample mean to deviate from the population mean):
Recall that the procedure of sampling produces an estimate of the true mean that is normally distributed with variance equal to the standard error of the mean. This means that there is a 68% chance that the true mean is within SE of the sample mean, and a 95% chance that the true mean is within 2*SE of the sample mean.
Here's a table summary of these quantities:
[X1,Y1] = cumhist(mydata1,[-600 600],1);
[X2,Y2] = cumhist(mydata2,[-600 600],1);
For your study, do you want to report the full distribution or the mean and standard error?
Many papers report only the mean and standard error of their experiments. Let's look at how this is done. Let's use the example chicken weights from animals that were fed normal corn or lysin-enriched corn from the PS1_2.zip file.
If we wanted to report this data, we could plot the entire distribution:
or we could simply plot the mean and standard error (using the std function in Matlab, which computes the estimate of the true distribution standard deviation, ŝ, as above):
Q1: Which plot do you like better for this data? Which do you think tells you the most about the experiment?
Sources of variation / deviation
Variation in actual experiments has many sources. First, there is the variation in the underlying true distribution itself. Second, we are sampling, so there is some inherent uncertainty in our knowledge of the true distribution. Third, there may be noise in the measurements that we are able to make due to our instrumentation or other factors.
Suppose the measurements of chicken weights were very noisy, such that the measurements were Normally distributed with a standard deviation of 100 grams (wiggly chickens, for instance). We can simulate this situation by adding noise to the data above. The function randn generates pseudorandom noise (see help randn).
You can verify the mean and standard deviation of the noise produced by the randn function:
In this simulation, we'd like to generate a set of random values that is the same size as our experimental data. To do this, we can use the function size (see help size).
Now we can create our simulated data:
Q2: Does knowing that the level of pure measurement noise in this new sample is large have an impact on your opinion of which graph is more informative?
Q3: To what can we attribute variation in a sample? Of the things you mention, is it always possible to know how much each one contributes to the variation in the sample?
Tinkering with plots
In the last 2 labs and in the homework, we have created several plots. Matlab offers a lot of flexibility for customizing these plots. In this section, we will explore the data structures that underlie Matlab plots and show you how to edit their fields from the command line.
Please make sure you have the correct histbins.m file, and then let's generate some data for plotting:
We can get the properties of the figure we just made using the get command:
You will see a number of property name and value pairs displayed on the screen. I get the following (you can skim these; no need to read them in depth, but do notice there are a lot of properties that have values):
'Figure' property fields
We can modify the parameters of these fields with the set command:
The variable f is called a handle to the figure. Its value, (a whole number for figures), is essentially a common reference number that the user and Matlab can use to refer to the figure. Each graphics object in Matlab, like figures and sets of plotting axes, has a unique handle number.
We can use the function gcf ("get current figure") to return the handle to the frontmost figure. This is a good way to obtain the handle for a figure if you don't know it (or if your program doesn't know it):
One can access the objects that are part of the figure by examining the figure's children field:
In this case, the figure has 1 object, which corresponds to the plotting axes on the figure. We can also access the current plotting axes with the function gca ("get current axes"):
We can look at the properties of axes as follows:
On my system, I see a long list of properties (you can skim them, no need to read them in depth):
'Axes' property fields
We can adjust a number of properties of the axes using the handle.
Q4: What happened to the plot after you ran these 4 statements?
One sad fact of Matlab is that these properties and what they control are not very well documented. Most of what I have learned has come from trying different things and seeing what happens. Recently, Matlab has added a feature to set that gives a little feedback on what values some properties can take. If you call set with no property value, it returns a list of possible values (but only for properties that can take discrete values; some properties take continuous values, and you're out of luck). For example:
Q5: What values can the axes property YDir take?
Our set of axes also has children; the children correspond to the items that are plotted on the axes. In this case, we currently only have the bar plot:
The properties of a bar plot are as follows on my system:
'Bar plot' property fields
We can play with these fields as well:
The general idea of examining a graphic object's handle for its field values, and then examining its children for their field values, is very helpful for creating customized plots that highlight exactly what you want to show.
One can also delete handles using the delete function (but be careful; if you give delete a string input like 'myfile', it will assume you are passing a filename you want to delete):
Multiple axes per figure
Matlab also has routines that help you to arrange more than one set of axes on a figure at a time. The easiest way to do this is with the subplot function. For example:
subplot takes 3 input arguments: the number of rows of axes you want, the number of columns of axes you want, and the axes number it should create (numbered left to right, top to bottom; see help subplot). In this case, we defined a 2 row by 2 column matrix of axes, and plotted a graph in each one.
Q6: How many children do you think the figure has now?
Q7: Are they the axes that were created by the subplot calls?
Matlab functions and operators
[no new functions]
Copyright 2011-2012 Stephen D. Van Hooser, all rights reserved.