Labs‎ > ‎

Lab 1.7 Basic sample descriptors; Tinkering with Plots

Introduction

In the last 2 labs, we have learned the concepts underlying basic statistical tests, and learned how to perform a few basic statistical procedures: we learned to test whether 2 samples are likely to be derived from the same true distributions (Kolmogorov-Smirnov test), how to establish confidence intervals around the mean (central limit theorem), and how to examine whether the means of 2 normal-ish distributions are the same or different (T-test of 2 samples).

In today's lab, we will focus more deeply on understanding common descriptors that are used with sampled data.  In the second half of the lab, we'll have some fun learning more about Matlab's plotting and graphics environment.

In Lab 1.5, we saw that the process of statistical inference involved sampling from a "true" distribution that we can never know exactly, and then making inferences about the true population based on the sample.  Here, we will discuss some basic descriptors of distributions, and how they are related to one another.


Measures of variation / deviation

Just about any phenomenon that we would want to study exhibits variation.  (If there were no variation, then there would be nothing to study, really.)  It's important to recognize that there are many sources of variation.

Let's imagine a variable like "height of people in the world who are older than 25 years of age".  Everyone in the world is not equally tall. Due to genetic and environmental factors both known and unknown to science, height varies from person to person.

One could imagine documenting this variation in a number of ways, some of which we've already seen.  We've looked at percentiles, histograms, and, my favorite, the cumulative histogram, which allows one to read off the entire structure of a whole distribution by examining all percentiles of the data.

The most commonly reported single measures of variation are the variance and its companion measure, the standard deviation.  The variance and standard deviation measure of how each value of a distribution deviates from the mean. 

Recall that the mean (represented by P with a bar over it) is just the numerical average of all values of the population (enumerated by P1, P2, P3, ... PM, where M is the total number of members of the population).


The variance is the average squared deviation of each point from the mean, and the standard deviation is just the square root of the variance:


Note that the variance has the units of the population value squared.  So, in our example of heights of the world, the variance has the units of meters2, whereas the standard deviation has the same units as the population value (in this case, meters).

You may ask yourself, why is the square of these deviations something useful to compute?  Why not calculate the absolute value of the difference between each sample and the mean, or the cube...why do people do this?  There are several reasons, but one is that many variables in nature and in mathematics are Normally distributed, and a Normally-distributed variable (that is, one that follows the Normal distribution we saw in Lab 1.6) can be described completely by its mean and variance/standard deviation.  When we turn to fitting a few labs down the road, we'll see another advantage to using the squared difference.

Inferring the mean and standard deviation of the "true" distribution from a sample

Suppose we have a sample of real data points that we'll call S, and we'll call the individual sample points S1, S2, ... Sn. ; then it is entirely possible to calculate the mean and standard deviation from the sample datapoints. To indicate that the source of these values are different from the "true" population in the equations above, we'll calculate analogous quantities with different names.

There is the sample mean S bar (S with a bar over it), the sample variance (s2), and sample standard deviation (s):


But, most of the time we actually want to infer the mean and standard deviation of the true distribution from which the samples are derived.  S bar is once again the estimate of the mean of the true distribution, but, amazingly, the equation for the estimate of the "true" distribution variance/standard deviation is slightly different:



What is the reason for the difference?  Why do we divide by N-1 instead of N to estimate the variance of the true distribution/population?  The explanation is a little mathy, but if you're interested, it's right here.  The short short version is that there is some uncertainty in the "true" mean that is unaccounted for by dividing by N (that's not really an explanation).

With an estimate of the sample standard deviation in hand, we can estimate the standard error of the mean, which we saw in the last lab allows us to estimate confidence intervals around the sample mean (that is, it answers how much we expect the sample mean to deviate from the population mean):
Recall that the procedure of sampling produces an estimate of the true mean that is normally distributed with variance equal to the standard error of the mean.  This means that there is a 68% chance that the true mean is within SE of the sample mean, and a 95% chance that the true mean is within 2*SE of the sample mean.

Here's a table summary of these quantities:

Common statistical descriptors


 Parameter "True" distribution Sampled distribution
(Matlab)
 Estimate of the "true" distribution
(Matlab)
 Mean  µ
mean(S)
, mean(S)
Variance  σ2 s2, var(S,1)  ŝ2, var(S)
 Standard deviation σ s, std(S,1)  ŝ, std(S)

(*) The sample estimate of the mean has a standard deviation equal to ŝ/sqrt(N), where N is the number of data points in the sample.  This is called the standard error of the mean.
For your study, do you want to report the full distribution or the mean and standard error?
 

Many papers report only the mean and standard error of their experiments. Let's look at how this is done.  Let's use the example chicken weights from animals that were fed normal corn or lysin-enriched corn from the PS1_2.zip file.

chicken_weights = load('chickenweights_control_experimental.txt','-ascii');
chicken_control = chicken_weights(:,1); % grab the first column
chicken_exper = chicken_weights(:,2);  % grab the 2nd column
mn1 = mean(chicken_control);
mn2 = mean(chicken_exper);

If we wanted to report this data, we could plot the entire distribution:

figure;
bar(1,mn1,'w');
hold on;
bar(2,mn2,'w');
plot(1,chicken_control,'go');
plot(2,chicken_exper,'rs');
xlabel('Groups');
ylabel('Weight');
title('Chicken weights after normal feed (left) or enriched feed (right)');

or we could simply plot the mean and standard error (using the std function in Matlab, which computes the estimate of the true distribution standard deviation, ŝ, as above):

figure;
se_control = std(chicken_control)/sqrt(length(chicken_control));
se_enriched = std(chicken_exper)/sqrt(length(chicken_exper));
bar(1,mn1,'w');
hold on;
bar(2,mn2,'w');
plot([1 1],[mn1-se_control mn1+se_control],'k-');
plot([2 2],[mn2-se_enriched mn2+se_enriched],'k-');
xlabel('Groups');
ylabel('Weight');
title('Chicken weights after normal feed (left) or enriched feed (right)');

Q1:  Which plot do you like better for this data? Which do you think tells you the most about the experiment?

Sources of variation / deviation

Variation in actual experiments has many sources.  First, there is the variation in the underlying true distribution itself.  Second, we are sampling, so there is some inherent uncertainty in our knowledge of the true distribution.  Third, there may be noise in the measurements that we are able to make due to our instrumentation or other factors.

Suppose the measurements of chicken weights were very noisy, such that the measurements were Normally distributed with a standard deviation of 100 grams (wiggly chickens, for instance). We can simulate this situation by adding noise to the data above.  The function randn generates pseudorandom noise (see help randn).

randn(1,2) % 1x2 set of random values with mean 0 and standard deviation 1
5*randn(1,2) % 1x2 set of random values with mean 0 and standard deviation 5

You can verify the mean and standard deviation of the noise produced by the randn function:

std(5*randn(1000,1))   % standard deviation of 1000 random points
mean(5*randn(1000,1)) % mean of 1000 random points

In this simulation, we'd like to generate a set of random values that is the same size as our experimental data.  To do this, we can use the function size (see help size).

size(chicken_exper)

Now we can create our simulated data:

chicken_exper2 = chicken_exper + 50*randn(size(chicken_exper));
chicken_control2 = chicken_control + 50*randn(size(chicken_control));
mn1_2 = mean(chicken_control2);
mn2_2 = mean(chicken_exper2);
figure;
bar(1,mn1_2,'w');
hold on;
bar(2,mn2_2,'w');
plot(1,chicken_control2,'go');
plot(2,chicken_exper2,'rs');
xlabel('Groups');
ylabel('Weight');
title('Noisy weights after normal feed (left) or enriched feed (right)');

figure;
se_control2 = std(chicken_control2)/sqrt(length(chicken_control2));
se_enriched2 = std(chicken_exper2)/sqrt(length(chicken_exper2));
bar(1,mn1_2,'w');
hold on;
bar(2,mn2_2,'w');
plot([1 1],[mn1_2-se_control2 mn1_2+se_control2],'k-');
plot([2 2],[mn2_2-se_enriched2 mn2_2+se_enriched2],'k-');
xlabel('Groups');
ylabel('Weight');
title('Noisy weights after normal feed (left) or enriched feed (right)');

Q2: Does knowing that the level of pure measurement noise in this new sample is large have an impact on your opinion of which graph is more informative?

Q3: To what can we attribute variation in a sample?  Of the things you mention, is it always possible to know how much each one contributes to the variation in the sample?


Tinkering with plots

In the last 2 labs and in the homework, we have created several plots.  Matlab offers a lot of flexibility for customizing these plots.  In this section, we will explore the data structures that underlie Matlab plots and show you how to edit their fields from the command line.  

Please make sure you have the correct histbins.m file, and then let's generate some data for plotting:

mydata1 = 100*generate_random_data(100,'normal',1,1);
[N,bin_centers] = histbins(mydata1,[-500:100:500]);
f = figure;
bar(bin_centers,N);
xlabel('X'); 
ylabel('Y');

We can get the properties of the figure we just made using the get command:

get(f)

You will see a number of property name and value pairs displayed on the screen.  I get the following (you can skim these; no need to read them in depth, but do notice there are a lot of properties that have values):

'Figure' property fields

Alphamap = [ (1 by 64) double array]
CloseRequestFcn = closereq
Color = [0.8 0.8 0.8]
Colormap = [ (64 by 3) double array]
CurrentAxes = [341.002]
CurrentCharacter = 
CurrentObject = [341.002]
CurrentPoint = [482 395]
DockControls = on
FileName = 
IntegerHandle = on
InvertHardcopy = on
KeyPressFcn = 
KeyReleaseFcn = 
MenuBar = figure
Name = 
NextPlot = add
NumberTitle = on
PaperUnits = inches
PaperOrientation = portrait
PaperPosition = [0.25 2.5 8 6]
PaperPositionMode = manual
PaperSize = [8.5 11]
PaperType = usletter
Pointer = arrow
PointerShapeCData = [ (16 by 16) double array]
PointerShapeHotSpot = [1 1]
Position = [680 494 560 420]
Renderer = painters
RendererMode = auto
Resize = on
ResizeFcn = 
SelectionType = normal
ToolBar = auto
Units = pixels
WindowButtonDownFcn = 
WindowButtonMotionFcn = 
WindowButtonUpFcn = 
WindowKeyPressFcn = 
WindowKeyReleaseFcn = 
WindowScrollWheelFcn = 
WindowStyle = normal
XDisplay = /tmp/launch-iH0FBE/org.x:0
XVisual = 0x24 (TrueColor, depth 24, RGB mask 0xff0000 0xff00 0x00ff)
XVisualMode = auto

BeingDeleted = off
ButtonDownFcn = 
Children = [341.002]
Clipping = on
CreateFcn = 
DeleteFcn = 
BusyAction = queue
HandleVisibility = on
HitTest = on
Interruptible = on
Parent = [0]
Selected = off
SelectionHighlight = on
Tag = 
Type = figure
UIContextMenu = []
UserData = []
Visible = on


We can modify the parameters of these fields with the set command:

set(f,'Color',[1 0 0]); % changes the background color to red
set(f,'MenuBar','none'); % turns off the menu bar

Handles

The variable f is called a handle to the figure.  Its value, (a whole number for figures), is essentially a common reference number that the user and Matlab can use to refer to the figure.  Each graphics object in Matlab, like figures and sets of plotting axes, has a unique handle number.

We can use the function gcf ("get current figure") to return the handle to the frontmost figure.  This is a good way to obtain the handle for a figure if you don't know it (or if your program doesn't know it):

f = gcf

One can access the objects that are part of the figure by examining the figure's children field:

objects = get(f,'children')

In this case, the figure has 1 object, which corresponds to the plotting axes on the figure.  We can also access the current plotting axes with the function gca ("get current axes"):

ax = gca

We can look at the properties of axes as follows:

get(ax)

On my system, I see a long list of properties (you can skim them, no need to read them in depth):

'Axes' property fields

ActivePositionProperty = outerposition
ALim = [0 1]
ALimMode = auto
AmbientLightColor = [1 1 1]
Box = on
CameraPosition = [0 20 17.3205]
CameraPositionMode = auto
CameraTarget = [0 20 0]
CameraTargetMode = auto
CameraUpVector = [0 1 0]
CameraUpVectorMode = auto
CameraViewAngle = [6.60861]
CameraViewAngleMode = auto
CLim = [1 2]
CLimMode = auto
Color = [1 1 1]
CurrentPoint = [ (2 by 3) double array]
ColorOrder = [ (7 by 3) double array]
DataAspectRatio = [500 20 1]
DataAspectRatioMode = auto
DrawMode = normal
FontAngle = normal
FontName = Helvetica
FontSize = [10]
FontUnits = points
FontWeight = normal
GridLineStyle = :
Layer = bottom
LineStyleOrder = -
LineWidth = [0.5]
MinorGridLineStyle = :
NextPlot = replace
OuterPosition = [0 0 1 1]
PlotBoxAspectRatio = [1 1 1]
PlotBoxAspectRatioMode = auto
Projection = orthographic
Position = [0.13 0.11 0.775 0.815]
TickLength = [0.01 0.025]
TickDir = in
TickDirMode = auto
TightInset = [0.0767857 0.0904762 0.00357143 0.0190476]
Title = [347.002]
Units = normalized
View = [0 90]
XColor = [0 0 0]
XDir = normal
XGrid = off
XLabel = [345.002]
XAxisLocation = bottom
XLim = [-500 500]
XLimMode = auto
XMinorGrid = off
XMinorTick = off
XScale = linear
XTick = [-450 -350 -250 -150 -50 50 150 250 350 450]
XTickLabel = 
-450
-350
-250
-150
-50 
50  
150 
250 
350 
450 
XTickLabelMode = auto
XTickMode = manual
YColor = [0 0 0]
YDir = normal
YGrid = off
YLabel = [346.002]
YAxisLocation = left
YLim = [0 40]
YLimMode = auto
YMinorGrid = off
YMinorTick = off
YScale = linear
YTick = [0 5 10 15 20 25 30 35 40]
YTickLabel = 
10
15
20
25
30
35
40
YTickLabelMode = auto
YTickMode = auto
ZColor = [0 0 0]
ZDir = normal
ZGrid = off
ZLabel = [348.002]
ZLim = [-1 1]
ZLimMode = auto
ZMinorGrid = off
ZMinorTick = off
ZScale = linear
ZTick = [-1 0 1]
ZTickLabel = 
ZTickLabelMode = auto
ZTickMode = auto

BeingDeleted = off
ButtonDownFcn = 
Children = [342.005]
Clipping = on
CreateFcn = 
DeleteFcn = 
BusyAction = queue
HandleVisibility = on
HitTest = on
Interruptible = on
Parent = [2]
Selected = off
SelectionHighlight = on
Tag = 
Type = axes
UIContextMenu = []
UserData = []
Visible = on


We can adjust a number of properties of the axes using the handle.

axis([-500 500 0 50]);
set(ax,'YTick', [0 50]);
set(ax,'YDir','reverse');
set(ax,'YDir','normal');

Q4: What happened to the plot after you ran these 4 statements?

One sad fact of Matlab is that these properties and what they control are not very well documented.  Most of what I have learned has come from trying different things and seeing what happens.  Recently, Matlab has added a feature to set that gives a little feedback on what values some properties can take.  If you call set with no property value, it returns a list of possible values (but only for properties that can take discrete values; some properties take continuous values, and you're out of luck).  For example:

set(ax,'YDir')

Q5: What values can the axes property YDir take?

Our set of axes also has children; the children correspond to the items that are plotted on the axes.  In this case, we currently only have the bar plot:

ch = get(ax,'children')

get(ch)

The properties of a bar plot are as follows on my system:

'Bar plot' property fields

            Annotation: [1x1 hg.Annotation]
           DisplayName: ''
           HitTestArea: 'off'
          BeingDeleted: 'off'
         ButtonDownFcn: []
              Children: 343.0017
              Clipping: 'on'
             CreateFcn: []
             DeleteFcn: []
            BusyAction: 'queue'
      HandleVisibility: 'on'
               HitTest: 'on'
         Interruptible: 'on'
                Parent: 341.0017
    SelectionHighlight: 'on'
                   Tag: ''
                  Type: 'hggroup'
         UIContextMenu: []
              UserData: []
              Selected: 'off'
               Visible: 'on'
                 XData: [-450 -350 -250 -150 -50 50 150 250 350 450]
             XDataMode: 'manual'
           XDataSource: ''
                 YData: [0 0 0 2 13 40 31 9 4 1]
           YDataSource: ''
             BaseValue: 0
              BaseLine: 344.0017
             BarLayout: 'grouped'
              BarWidth: 0.8000
            Horizontal: 'off'
             LineWidth: 0.5000
             EdgeColor: [0 0 0]
             FaceColor: 'flat'
             LineStyle: '-'
          ShowBaseLine: 'on'


We can play with these fields as well:

mybar = ch; % so we don't have to keep referring to ch
set(mybar,'barwidth',1); % what happens?
set(mybar,'barwidth',0.5); 
set(mybar,'facecolor',[1 0 0]); % turn it red
set(mybar,'facecolor',[0 0 0]); % paint it black

The general idea of examining a graphic object's handle for its field values, and then examining its children for their field values, is very helpful for creating customized plots that highlight exactly what you want to show.

One can also delete handles using the delete function (but be careful; if you give delete a string input like 'myfile', it will assume you are passing a filename you want to delete):

delete(mybar);
delete(ax);
delete(f);

Multiple axes per figure

Matlab also has routines that help you to arrange more than one set of axes on a figure at a time. The easiest way to do this is with the subplot function. For example:

mydata1 = 100*generate_random_data(100,'normal',1,1);
mydata2 = 100*generate_random_data(100,'normal',1,1);
f = figure;
[X1,Y1] = cumhist(mydata1,[-600 600],1);
[X2,Y2] = cumhist(mydata2,[-600 600],1);
[N1,bin_centers] = histbins(mydata1,[-600:100:600]);
[N2,bin_centers] = histbins(mydata2,[-600:100:600]);
ax1=subplot(2,2,1);
plot(X1,Y1,'k-');
xlabel('Value');
ylabel('Fraction of data 1');
ax2=subplot(2,2,2);
plot(X2,Y2,'k-');
xlabel('Value');
ylabel('Fraction of data 2');
ax3=subplot(2,2,3);
bar(bin_centers,N1,'k');
xlabel('Value');
ylabel('Number of data 1 points');
ax4=subplot(2,2,4);
bar(bin_centers,N2,'k');
xlabel('Value');
ylabel('Number of data 2 points');

subplot takes 3 input arguments: the number of rows of axes you want, the number of columns of axes you want, and the axes number it should create (numbered left to right, top to bottom; see help subplot).  In this case, we defined a 2 row by 2 column matrix of axes, and plotted a graph in each one.

Q6: How many children do you think the figure has now?

ch = get(f,'children')

Q7: Are they the axes that were created by the subplot calls?

ax1,
ax2,
ax3,
ax4

Function reference

Matlab functions and operators

User functions

  • [no new functions]

Copyright 2011-2012 Stephen D. Van Hooser, all rights reserved.
Comments