Lab 1.3 More plotting and (dot) m-files

Introduction

In this lab, we will explore more plot types and learn how to write our own functions. For this lab, make sure you have downloaded the functions in Lab1.1 and have changed the current directory so that it is the Lab1_1 directory on your computer.

Tables for organization in code and mind

Let's revisit the way we generated data in the last lab:

% generates "bell curve" data with mean 1 and standard deviation 1

mydata1 = 100*generate_random_data(10,'normal',1,1); 

% generates "bell curve" data with mean 0 and standard deviation 1

mydata2 = 100*generate_random_data(10,'normal',0,1); 

When we created the array variables mydata1 and mydata2, we filled them with numbers, but we have to keep track of the meaning of these variables ourselves. Wouldn't it be nice if there were a way to organize these data a little bit more so we remember what they mean? There are infinitely many ways to do this, but one nice way is to encapsulate scientific observations within a data structure called a table.  A table is a bit like a matrix except it allows you to give titles to your columns and rows, and also allows you to store different types of data within the structure. Let's make a table.

% generates "bell curve" data with mean 1 and standard deviation 1

mydata1 = 100*generate_random_data(10,'normal',1,1); 

mytable1 = array2table(mydata1,'VariableNames',["Class 1"]);

% generates "bell curve" data with mean 0 and standard deviation 1

mydata2 = 100*generate_random_data(10,'normal',0,1); 

mytable2 = array2table(mydata2,'VariableNames',["Class 2"]);

Let's look at what we did:

mytable1,

mytable2,

Tables allow you to access data in a number of different ways. You can access:

mytable1(:,1), % select the first column

Or you can select the data by column name by calling the column name with parenthesis after a dot:

mytable1.("Class 1")

Tables will help us keep our data organized later on as things get more complicated. In the class, we will sometimes use tables and sometimes just use variables.

Adding a bar plot to our previous graphs

Let's revisit the graph that we made last time and add a bar graph using the Matlab command bar:

As in the last lab we can calculate the means:

mn1 = mean(mytable1.("Class 1"));

mn2 = mean(mytable2.("Class 2"));

Now let's add a bar graph to our plot of the raw data. The bar graph will be equal to the mean of each data set:

figure;

bar(1,mn1,'w');

hold on; % leave the current plot on the figure

bar(2,mn2,'w');

plot(1,mytable1.("Class 1"),'og')

plot(2,mytable2.("Class 2"),'sb')

axis([0 3 -300 500])

xlabel('Class number');

ylabel('Change in student happiness');

title('Change in student happiness in Class 1 vs. Class 2');

The bar graph we have just plotted isn't a big addition by itself, but bar graphs are used in generating histograms, which we will turn to next.

Histograms

Another way of representing data is using histograms.  In Lab 1.1 we learned that, in a histogram, the X-axis is divided into bins, and the Y-axis reports the number of observations that fall into that bin. Matlab has a few built-in histogram functions, the simplest of which is called hist:

figure;

hist(mytable1.("Class 1"),10); % creates a histogram with 10 bins

xlabel('Change in happiness after Class 1');

ylabel('Number of occurrences');

One nice thing about histograms is that they are quite adept at summarizing a large amount of data.  Suppose our study of happiness were expanded to 1000 subjects.

mydata1000 = 100*generate_random_data(1000,'normal',1,1);

myothertable = array2table(mydata1000,'VariableNames',["Class 1"]);

figure;

hist(myothertable.("Class 1"),100); % creates a histogram with 100 bins

xlabel('Change in happiness after Class 1');

ylabel('Number of occurrences');

Bins

With a histogram, one must always choose the number of bins to use.  Choosing an inappropriate number of bins won't allow you to display the data in a very meaningful way.  Consider:

figure;

hist(mytable1.("Class 1"),100); % creates a histogram with 100 bins

xlabel('Change in happiness after Class 1');

ylabel('Number of occurrences');

Q1: Do you get a sense of the underlying data from this histogram with 100 bins?  Choosing bins carefully is important with histograms.  You can also check out a few rules of thumb for choosing bin signs here.

In science, we typically want to have precise control over the number and size of the bins that are being used to plot our data.  Matlab allows one to have full control over the bins using a function called histc.  With histc, one specifies the edges of the bins.  Let's start with a simple example:

bin_edges = [ 0 1 2 3 4 5 ];

data = [ 0.5 0.5 1.5 1.5 3.5 3.5 3.7 4 4 ];

N = histc(data, bin_edges); % get the counts for each bin

N,  % look at the counts

bin_centers = bin_edges + 0.5; 

figure;

bar(bin_centers,N);

xlabel('Value');

ylabel('Number of occurrences');

Examine the bin_edges variable and the output of histc in this example, N.  In this example, we defined the histogram bin edges; the first bin will consist of all points greater or equal to 0 and less than 1; histc reports that the number of data points in this bin is 2. The function histc also returns an "extra" bin; in its last value it provides the number of data points that are equal to the last bin edge (in this case, numbers equal to 5, of which there are none).

When we plot, we want the center of the bars to be plotted at the center of each bin, so we compute the bin center locations (in bin_centers) in order to plot the data.

Q2: In the example above using histc, which bin was assigned the data points with value 4?  Why? (See help histc.)

The colon operator ':' for filling periodic arrays

Let's turn our attention to creating a custom histogram for the 1000-data point experiment contained in the variable mydata1000.  To do this, we'll need to create a very large array for the bins.  It would be a big pain to type out all the bin edges on the command line. Fortunately, there is an operator in Matlab that can help: the colon (:) operator.

Suppose we want to create a vector 1 2 3 4 5.  We could write

A = [ 1 2 3 4 5 ]

But we could also use the colon operator:

A = 1:5

This statement with the colon means "create a vector starting at the value 1, and increment each subsequent entry by 1 until you reach the value 5".

It is possible to use a step that is not 1 by using a pair of colons:

B1=5:-1:1 

B2=0:0.1:1

The colon operator is also useful for selecting subsets of an array.  Consider a small set of numbers that might be the bin edges of a histogram:

C1 = [ 0 1 2 3 7 8 9 10];

C2 = [ 0 1 2 3 4 5 6 7];

Note that the bin edges in C1 change width part way through, while the bins in C2 are constant. In order for our plot to look correct, we'd want to calculate the center location of these bins.

The first 2 bin edges in C1 are 0 and 1; that is, C1(1) is 0, and C1(2) is 1. Therefore, the center of this bin is 0.5:

C1(1)

C1(2)

(C1(1)+C1(2))/2

We'd like to use the colon operator to calculate this difference for each bin.  We can calculate the centers of these bins by calculating a bin-by-bin difference:

Look at

C1(1:end-1)

C1(2:end)

(C1(1:end-1)+C1(2:end))/2

(C2(1:end-1)+C2(2:end))/2

Now let's tackle our example of choosing bins for a large data set:

bin_edges = [ -1000:100:1000 ];

N = histc(mydata1000, bin_edges); % get the counts for each bin

bin_centers = (bin_edges(1:end-1)+bin_edges(2:end))/2; 

  % histc actually returns an extra bin in N containing values that

  % are exactly bin_edges(end); we don't want this so let's drop this bin

N = N(1:end-1);

figure;

bar(bin_centers,N);

xlabel('Change in happiness after Class 1');

ylabel('Number of occurrences');

Script M files (or, 'dot-M' files)

Up until now you've typed all of your commands directly into the command line.  Of course, for doing your homework, you'll probably want to put your code into a file so you can print it out easily and remember what you have done.  Let's make a 'dot-m' file for the last example.

In the Matlab command window, choose 'New...Script file' from the file menu.  Now you have an open text window.  Re-type the following lines into the window:

mydata1000 = 100*generate_random_data(1000,'normal',1,1);

bin_edges = [ -1000:100:1000 ];

N = histc(mydata1000, bin_edges); % get the counts for each bin

N = N(1:end-1);

bin_centers = (bin_edges(1:end-1)+bin_edges(2:end))/2;

figure;

bar(bin_centers,N);

xlabel('Change in happiness after Class 1');

ylabel('Number of occurrences');

Now choose Save as from the file menu.  Let's give this file the name mydata1000hist.m.  Save it into the current directory.

Now, from the command line, run the M-file by calling its name (without the .m):

mydata1000hist

Function M files

There are 2 flavors of M-files: scripts and functions.  The M-file we just wrote above is a script. It is essentially a bunch of lines that you could just type on the command line, and the purpose of saving them in a file is for convenience (to save yourself typing, or to make sure you do the same thing over and over again).

However, sometimes (in fact, most of the time) one wants to be able to run a script over and over again with different inputs to examine the outputs. In the last lab, we learned about functions:

Functions

Functions have a name, and accept inputs called input arguments that are enclosed in parentheses and are separated by commas. Functions return outputs called output arguments; the functions we have seen up to now return only 1 output, so we have defined a single variable to be equal to the output, but in principle there could be more:

[output1, output2, ...] = function_name(input1, input2, ...)

Let's create an example function for computing the bin centers and counts that we need in order to plot a histogram with custom bins.

From the file menu, choose "New ...Function file"

Example function histbins

Type the following in the window:

histbins.m

function [N, bin_centers] = histbins(data, bin_edges)

% HISTBINS - Generate counts and bin centers for custom histogram

%  [N, BIN_CENTERS] = HISTBINS(DATA, BIN_EDGES)

%

%   Inputs: DATA - 1-dimensional data observations

%           BIN_EDGES - The edges of the bins

%   Outputs: N - A vector of counts for each bin

%            BIN_CENTERS - The bin centers

N = histc(data, bin_edges);

N = N(1:end-1);

bin_centers = 0.5*(bin_edges(2:end) + bin_edges(1:end-1));

Now save this file as "histbins" (Matlab should add the ".m").  Let's try to run it:

[N,bin_centers] = histbins(mydata1000, -1000:100:1000);

figure;

bar(bin_centers,N);

xlabel('Change in happiness after Class 1');

ylabel('Number of occurrences');

All of the commented text above the first lines of code constitutes the user help for the function.  Try reading it:

help histbins

More about functions

Functions allow the user to write small pieces of code that are general and can be used over and over again 1) to save work, and 2) to ensure that the task is done consistently each time.  As one gains experience, one learns that the act of breaking down a task into smaller functions helps you to understand which elements in a system need to interact, and which are independent.

Writing a function also allows you to focus on the smaller problem that is the function's task, rather than the system as a whole. When Matlab is executing a function, the function only knows about its internal variables, which are called local variables.  

When we called histbins using [N,bin_centers]=histbins(mydata1000,-1000:100,1000), the local variable of histbins that is called data (the first input argument) was told to take the value of the main workspace variable called mydata1000, and the local variable bin_edges (the second input argument) was set to have the value -1000:100:1000.  However, histbins doesn't know anything else about the variables on the main workspace; it cannot see or access the name mydata1000, it cannot see or access A, or C1, or mydata1.  By the same token, the main workspace cannot access any local variables inside histbins except the output arguments N and bin_centers.  

Writing a function is like being in a quiet place. One only has to focus on turning the inputs into the outputs, and nothing more.

Cumulative histograms

The cumulative histogram is an alternative to the histogram for plotting an entire distribution. Cumulative histograms (sometimes called cumulative density histogram) indicate the fraction of data that is less than value X, and range (in Y value) from 0 to 1 (or 0% to 100%).  Cumulative histograms offer several advantages to regular histograms:

Mathematically, the cumulative histogram is the "integral" (area under the curve) of a traditional histogram, and the traditional histogram is the derivative (rate of change) of the cumulative histogram.  

Let's plot a cumulative histogram before writing a function to make it easier for us.  To make a cumulative histogram, we want to start on the X axis at a value that is lower than any value in the data, and slowly move to the right on the X axis, indicating what fraction of the data is less than that value. To do this, we can sort the data from least to greatest (performed by the function unique), and make the percentage go up for each data point. We will make the data go "up" by computing the cumulative sum of 

bin_edges = [unique([mydata1])];

bin_counts = histc(mydata1, bin_edges);

X = [bin_edges bin_edges]'; % two points at each X location

counts = 100*cumsum(bin_counts) / sum(bin_counts);

Y = [[0;counts(1:end-1)] [counts]]';

X = X(:); % put the points into single columns

Y = Y(:); % put the points into single columns

figure;

plot(X,Y,'k-'); % plot a black line

ylabel('Fraction of participants');

xlabel('Change in happiness after Class 1');

Two new functions were snuck into the above code: min(X) returns the value of the minimum element of the vector X, and max(X) returns the maximum element of the vector X.

An M file for cumulative histograms

Now let's write a function M-file for the cumulative histogram called cumhist (analogous to Matlab's function cumsum).  Choose "New...Function file" from the File menu.

cumhist.m

function [X,Y] = cumhist(data)

% CUMHIST - Make data for a cumulative histogram plot

%

%   [X,Y]=CUMHIST(DATA)

%

%    Generates X and Y variables suitable for plotting with

%    the plot function. Y values are generated for percentages

%    ranging from 0 to 100% in steps of PERCENT_DELTA. The XRANGE

%    for the plot is specified in XRANGE = [lowvalue highvalue].

%    

%    Example:

%

%      r=randn(1000,1); % 1000 normally-distributed random numbers

%

%      [X,Y] = cumhist(r);

%      figure;

%      plot(X,Y,'k-');

%      hold on;

%      plot([-4 4],[50 50],'k--'); % add 'median' (50%-tile) dashed line

%      box off;

%      ylabel('Percentage');

%      xlabel('Values');

%

%    See also: PRCTILE, CUMSUM, HIST    


data = data(:); % make it a column

bin_edges = [unique([data])];

bin_counts = histc(data, bin_edges);

X = [bin_edges bin_edges]'; % two points at each X location

counts = 100 * cumsum(bin_counts) / sum(bin_counts);

Y = [[0;counts(1:end-1)] [counts]]';

X = X(:); % put the points into single columns

Y = Y(:); % put the points into single columns

Now let's employ this function to compare our 2 data sets:

[X1,Y1] = cumhist(mydata1);

[X2,Y2] = cumhist(mydata2);

figure;

plot(X1,Y1,'r-');

hold on;

plot(X2,Y2,'b-');

axis([-500 500 0 100]); % set X axis to be from -500 to 500, Y axis from 0 to 100

xlabel('Happiness after class');

ylabel('Fraction of students');

legend('Class 1','Class 2'); % new function! see help legend

Function reference

Matlab functions and operators

User functions

Copyright 2011-2012 Stephen D. Van Hooser, all rights reserved.