Lab 1.3 More plotting and (dot) m-files
In this lab, we will explore more plot types and learn how to write our own functions. For this lab, make sure you have downloaded the functions in Lab1.1 and have changed the current directory so that it is the Lab1_1 directory on your computer.
Adding a bar plot to our previous graphs
Let's revisit the graph that we made last time and add a bar graph using the Matlab command bar:
As in the last lab we can calculate the means:
Now let's add a bar graph to our plot of the raw data. The bar graph will be equal to the mean of each data set:
The bar graph we have just plotted isn't a big addition by itself, but bar graphs are used in generating histograms, which we will turn to next.
Another way of representing data is using histograms. In Lab 1.1 we learned that, in a histogram, the X-axis is divided into bins, and the Y-axis reports the number of observations that fall into that bin. Matlab has a few built-in histogram functions, the simplest of which is called hist:
One nice thing about histograms is that they are quite adept at summarizing a large amount of data. Suppose our study of happiness were expanded to 1000 subjects.
With a histogram, one must always choose the number of bins to use. Choosing an inappropriate number of bins won't allow you to display the data in a very meaningful way. Consider:
Q1: Do you get a sense of the underlying data from this histogram with 100 bins? Choosing bins carefully is important with histograms. You can also check out a few rules of thumb for choosing bin signs here.
In science, we typically want to have precise control over the number and size of the bins that are being used to plot our data. Matlab allows one to have full control over the bins using a function called histc. With histc, one specifies the edges of the bins. Let's start with a simple example:
Examine the bin_edges variable and the output of histc in this example, N. In this example, we defined the histogram bin edges; the first bin will consist of all points greater or equal to 0 and less than 1; histc reports that the number of data points in this bin is 2. Oddly, histc also returns an "extra" bin; in its last value it provides the number of data points that are equal to or greater than the last bin edge (in this case, numbers equal to 5, of which there are none).
When we plot, we want the center of the bars to be plotted at the center of each bin, so we compute the bin center locations (in bin_centers) in order to plot the data.
Q2: In the example above using histc, which bin was assigned the data points with value 4? Why? (See help histc.)
The colon operator ':' for filling periodic arrays
Let's turn our attention to creating a custom histogram for the 1000-data point experiment contained in the variable mydata1000. To do this, we'll need to create a very large array for the bins. It would be a big pain to type out all the bin edges on the command line. Fortunately, there is an operator in Matlab that can help: the colon (:) operator.
Suppose we want to create a vector 1 2 3 4 5. We could write
But we could also use the colon operator:
This statement with the colon means "create a vector starting at the value 1, and increment each subsequent entry by 1 until you reach the value 5".
It is possible to use a step that is not 1 by using a pair of colons:
The colon operator is also useful for selecting subsets of an array. Consider a small set of numbers that might be the bin edges of a histogram:
Note that the bin edges in C1 change width part way through, while the bins in C2 are constant. In order for our plot to look correct, we'd want to calculate the center location of these bins.
The first 2 bin edges in C1 are 0 and 1; that is, C1(1) is 0, and C1(2) is 1. Therefore, the center of this bin is 0.5:
We'd like to use the colon operator to calculate this difference for each bin. We can calculate the centers of these bins by calculating a bin-by-bin difference:
Now let's tackle our example of choosing bins for a large data set:
Script M files (or, 'dot-M' files)
Up until now you've typed all of your commands directly into the command line. Of course, for doing your homework, you'll probably want to put your code into a file so you can print it out easily and remember what you have done. Let's make a 'dot-m' file for the last example.
In the Matlab command window, choose 'New...Script file' from the file menu. Now you have an open text window. Re-type the following lines into the window:
Now choose Save as from the file menu. Let's give this file the name mydata1000hist.m. Save it into the current directory.
Now, from the command line, run the M-file by calling its name (without the .m):
Function M files
There are 2 flavors of M-files: scripts and functions. The M-file we just wrote above is a script. It is essentially a bunch of lines that you could just type on the command line, and the purpose of saving them in a file is for convenience (to save yourself typing, or to make sure you do the same thing over and over again).
However, sometimes (in fact, most of the time) one wants to be able to run a script over and over again with different inputs to examine the outputs. In the last lab, we learned about functions:
Functions have a name, and accept inputs called input arguments that are enclosed in parentheses and are separated by commas. Functions return outputs called output arguments; the functions we have seen up to now return only 1 output, so we have defined a single variable to be equal to the output, but in principle there could be more:
[output1, output2, ...] = function_name(input1, input2, ...)
Let's create an example function for computing the bin centers and counts that we need in order to plot a histogram with custom bins.
From the file menu, choose "New ...Function file"
Example function histbins
Type the following in the window:
Now save this file as "histbins" (Matlab should add the ".m"). Let's try to run it:
All of the commented text above the first lines of code constitutes the user help for the function. Try reading it:
More about functions
Functions allow the user to write small pieces of code that are general and can be used over and over again 1) to save work, and 2) to ensure that the task is done consistently each time. As one gains experience, one learns that the act of breaking down a task into smaller functions helps you to understand which elements in a system need to interact, and which are independent.
Writing a function also allows you to focus on the smaller problem that is the function's task, rather than the system as a whole. When Matlab is executing a function, the function only knows about its internal variables, which are called local variables.
When we called histbins using [N,bin_centers]=histbins(mydata1000,-1000:100,1000), the local variable of histbins that is called data (the first input argument) was told to take the value of the main workspace variable called mydata1000, and the local variable bin_edges (the second input argument) was set to have the value -1000:100:1000. However, histbins doesn't know anything else about the variables on the main workspace; it cannot see or access the name mydata1000, it cannot see or access A, or C1, or mydata1. By the same token, the main workspace cannot access any local variables inside histbins except the output arguments N and bin_centers.
Writing a function is like being in a quiet place. One only has to focus on turning the inputs into the outputs, and nothing more.
The cumulative histogram is an alternative to the histogram for plotting an entire distribution. Cumulative histograms (sometimes called cumulative density histogram) indicate the fraction of data that is less than value X, and range (in Y value) from 0 to 1 (or 0% to 100%). Cumulative histograms offer several advantages to regular histograms:
One does not need to choose a number of bins; the data can be represented continuously
One can easily read out the median and percentile ranges of the data (the median is the 50% percentile point, for example).
One can plot 2 or more cumulative histograms on the same axes to compare data obtained under different conditions.
Mathematically, the cumulative histogram is the "integral" (area under the curve) of a traditional histogram, and the traditional histogram is the derivative (rate of change) of the cumulative histogram.
Let's plot a simple cumulative histogram before writing a function to make it easier for us. The cumulative histogram is really just a continuous representation of the percentile function, and we can use that to produce the cumulative histogram:
Two new functions were snuck into the above code: min(X) returns the value of the minimum element of the vector X, and max(X) returns the maximum element of the vector X.
An M file for cumulative histograms
Now let's write a function M-file for the cumulative histogram called cumhist (analogous to Matlab's function cumsum). Choose "New...Function file" from the File menu.
Now let's employ this function to compare our 2 data sets: