In this lab, we will focus on organizing our workspace and files so that we can work efficiently.
In the second half of the lab, we will turn our attention back to samples and distributions, and specifically how we can perform mathematical operations on samples. Efficient Mfile managementIn the previous labs and in the homework, we created several Mfiles, including some scripts and some functions. In all of these labs, we have simply changed the current directory so that it corresponded to the directory where our Mfiles were located. However, imagine if one kept adding Mfiles to the same directory. After a few months of work, the directory would be full of dozens of Mfiles that may or may not be related to one another. So it is helpful to organize the files and to tell Matlab to look in more than one directory to find its Mfiles.
There is no one right or wrong way to curate your Matlab files, but learning to do it in some way will dramatically improve your efficiency as the number of projects (and labs and homeworks) you are working on increases. We will spend some time manually organizing our files to see how this is done. A suggested file organization schemeWe need to pick a directory that you will use for your Matlab files. Let's choose your UNET home directory, since it is accessible from the lab as well as anywhere in the world that you might be working.
Mounting your UNET home directoryTo mount your UNET home directory on a Mac, go to the program "Finder", go to the Go menu, and choose Connect to Server.... Enter smb://unethome.brandeis.edu/username (where username is your UNET username). Press Connect, and enter your password.
To mount your UNET home directory on a PC, click on Start and choose Run. Then type \\unethome.brandeis.edu\username (where username is your UNET username). Enter your password when prompted.
Now let's create a directory called MATLAB in your UNET folder (using either Windows or the Finder on a Mac).
Using either the Macintosh Finder or Windows, make 2 new directories within the MATLAB folder, 1 called data, 1 called tools. Within the tools directory, make a new folder called plotting, a new folder called stats, and a new folder called scripts. Within the scripts directory, make a new folder called Labs, and another called Homework.
The files that we have created can be downloaded by clicking on the links in the directory tree below. Download these files, and put them in the appropriate folders. (Note: the files change_mineral_...txt are data files we will examine today.)
Now your directory structure should look something like this:
[your UNET home]/MATLAB/
Adding directories to the Matlab search pathNow that we have our files organized, we have to tell Matlab to look in these directories for M files. We do this by adding directories to Matlab's search path. The path is a list of directories that Matlab examines to find Mfiles. We can see the current directories on the path by typing the command path:
path
Since we haven't added anything to the path yet, most of these directories will be Matlab's own function directories, which The MathWorks calls toolboxes.
To do this, let's specify our new function MATLAB directory in a variable. These will be character string variables, where the string is specified between single quotes (as in Lab 1.2):
On a Mac, write
mydirname = '/Volumes/yourusername/MATLAB/tools'
On a PC, write:
mydirname = '\\unethome.brandeis.edu\yourusername\MATLAB\tools'
To make sure we wrote it correctly, let's try to list the files in that directory using the command ls that we used in the first lab:
ls(mydirname)
Now we can generate our own list of directory paths to add to Matlab's path by using the command genpath:
p = genpath(mydirname)
Now we can add these directories to the path with the command addpath by typing
addpath(genpath(mydirname))
We can verify that Matlab can find our functions by using the which command (see help which).
which histbins
Q1: What is the output of which histbins?
We can get a list of our newly modified path directories with path:
path
Resolving path and file ambiguity with whichFile this note away for down the road: the command which is very useful for diagnosing issues with your path. If a function doesn't seem to be doing what you think it should be doing, try typing which functionname to make sure Matlab is grabbing the function from the directory it should be. This is particularly true if you begin to exchange files with friends; your friend could have the same filenames for some functions as you do. Matlab will grab the one that is "earliest" in its path but it won't give you an error that says "hey, there are 2 functions with the same name".
The startup.m fileIf you work on a computer consistently (for example, your computer at home), you may find it annoying to type in:
mydirname = '/Volumes/yourusername/MATLAB/tools' addpath(genpath(mydirname))
each time that you start up Matlab so that your files are added to the path. You can create a file called startup.m that is saved in your Documents/Matlab folder (for example, /Volumes/yourusername/Documents/Matlab on a Mac). This file will be run each time you start up Matlab. Simply add these commands to your startup.m file, and then your directories will be automatically added to the path when you start up. I will add a display call so that you will get a notice that the command is running.
mydirname = '/Volumes/yourusername/MATLAB/tools' disp(['Adding ' mydirname ' and all subdirectories to the path.']); addpath(genpath(mydirname)) clear mydirname; % no need to keep this variable around
Differences in cumulative histograms of 2 sets of samplesIn our next lecture, we will address how one can tell, statistically, if 2 distributions are different. Before we do that, we need to understand just a little bit more about math using sample distributions in Matlab.
One feature that has been shown to be of interest by mathematicians is the maximum difference (in absolute value) between the cumulative histograms of 2 samples. When you played the drugvsplacebo game, you probably picked up on the fact that the underlying distributions were likely to be different when the cumulative histograms differed greatly. We're talking about the following quantity that is denoted in black:
We'll look at this difference using real data found in the files change_mineral_breastfeeding.txt and change_mineral_other.txt. Breastfeeding women produce milk, which contains large amounts of calcium. This study wanted to examine if the process of breastfeeding reduced minerals that were present in the bones of women. To do this, the scientists measured the mineral content of the spinal cord in a group of breastfeeding women and other women over a time span of 3 months.
cd [your UNET home]/MATLAB/data
sample1 = load('change_mineral_breastfeeding.txt','ascii'); sample2 = load('change_mineral_other.txt','ascii'); [X1,Y1] = cumhist(sample1,[10 10],0.1); [X2,Y2] = cumhist(sample2,[10 10],0.1); figure(90); plot(X1,Y1,'bo'); hold on; plot(X2,Y2,'ro'); xlabel('Change in mineral content'); ylabel('Percent of samples'); title('Blue=breastfeeding women, Red=other women');
We want to calculate the maximum difference between these 2 cumulative histograms. To do this, we need to measure the distance between the distributions (along the y axis) at each location on the xaxis. However, we actually don't know the value of the function at each location of X. If you use the "zoom" tool to zoom in on the data points, you'll see that the values of X that comprise the plot are at different locations. In order to calculate the distance between these 2 curves, we need a common reference on the X axis. Creating cumulative histograms with a common XaxisThis is an example of a situation that arises frequently in realworld data analysis. We have the data to make the calculation that we'd like to make, but we must first make some transformation to get it into a form that we can work with. In this case, that transformation is to generate cumulative histograms such that the Xaxis points are in the same locations.
To create a common X axis, we will create a histogram, but we will choose bin edges such that they match the individual data points. Then, we will build up the cumulative histogram by adding up the number of points that fall into each bin. It may be easier to see this as an example:
To choose bin edges that match our data points, we will use the function sort to sort the data. We will also use the function unique to eliminate duplicate data points. Try it out:
sort(sample1) % now let's sort sample1 and sample2 together; % [sample1; sample2] makes a big list of the samples sort([sample1; sample2]) unique(sample1) unique([sample1; sample2])
Now we'll make a histogram using all of these values for the bins:
bin_edges = unique([sample1; sample2]); bin_counts1 = histc(sample1, bin_edges); bin_counts2 = histc(sample2, bin_edges);
Let's take a look at these results so far:
figure; bar(bin_edges, bin_counts1); bar(bin_edges, bin_counts2);We will next employ the function cumsum to provide the cumulative sum of the values in our bins. To see how this function works, try it out:
A = [ 1 2 3 4 5]; cumsum(A)
So, at each location, cumsum returns the cumulative sum of the elements in the vector up to that point. We can use this to make a cumulative histogram:
cumsum1 = cumsum(bin_counts1) / sum(bin_counts1); cumsum2 = cumsum(bin_counts2) / sum(bin_counts2);
We need to divide by the number of data points (the sum of the bin counts) so that the values are between 0 and 1; if we multiplied the above result by 100, then we would get values between 0 and 100%.
figure; plot(bin_edges,cumsum1,'b'); hold on; plot(bin_edges,cumsum2,'r'); xlabel('Sample values'); ylabel('Cumulative sum of samples'); Finding the value and location of the maximum difference between 2 curvesNow we have 2 curves with a common X axis. All that remains is to find the value of the maximum difference (in absolute terms) between these curves. We can do this in 3 small steps:  We can calculate the pointbypoint difference between these curves by subtracting 2 variables. For example:
A = [ 1 2 3 4 5]; B = [ 9 8 7 6 5]; BA,  We can use the function abs to calculate the absolute value of the result:
abs(1), abs(BA), abs(AB),  We can use the function max to find the value and the location of the largest point of a vector:
[maxvalueA,maxlocationA] = max(A) [maxvalueB,maxlocationB] = max(B) [maxdiff,maxdiff_location] = max(abs(BA))
Now let's put all of these steps together into an Mfile (you can copy and paste this into a new Mfile). Save this file in your functions/stats directory:
cumulative_hist_diff.mfunction [maxdiff, maxdiff_location, Xvalues, sample1CDF, sample2CDF] = cumulative_hist_diff(sample1,sample2) % CUMULATIVE_HIST_DIFF  Calculates the maximum difference b/w 2 samples % [MAXDIFF,MAXDIFF_LOCATION, XVALUES, SAMPLE1CDF, SAMPLE2CDF] = ... % CUMULATIVE_HIST_DIFF(SAMPLE1, SAMPLE2) % % Inputs: SAMPLE1  an array of sample data % SAMPLE2  an array of sample data % Outputs: MAXDIFF  The maximum (absolute) difference between the samples % MAXDIFF_LOCATION  The xaxis location of the maximum difference % XVALUES  The Xaxis values for the cumulative density functions % SAMPLE1CDF  The cumulative density function of SAMPLE1 % SAMPLE2CDF  The cumulative density function of SAMPLE2 bin_edges = [unique([sample1; sample2])]; bin_counts1 = histc(sample1, bin_edges); bin_counts2 = histc(sample2, bin_edges); Xvalues = bin_edges; sample1CDF = cumsum(bin_counts1) / sum(bin_counts1); sample2CDF = cumsum(bin_counts2) / sum(bin_counts2); [maxdiff, maxdiff_location] = max( abs(sample1CDFsample2CDF) );
Calculating the maximum difference between 2 cumulative histogramsNow let's use our Mfile to calculate the maximum difference between the 2 samples (in absolute value).
[maxdiff, maxdiff_location, Xvalues, sample1CDF, sample2CDF] = ... cumulative_hist_diff(sample1,sample2);
figure; plot(Xvalues,sample1CDF,'b'); hold on; plot(Xvalues,sample2CDF,'r'); xlabel('Change in mineral content'); ylabel('Percent of samples'); title('Blue=breastfeeding women, Red=other women');
Now let's add a bar showing the maximum difference at the proper location:
X_loc = [ Xvalues(maxdiff_location) Xvalues(maxdiff_location) ] Y_loc = [ sample1CDF(maxdiff_location) sample2CDF(maxdiff_location)] plot(X_loc, Y_loc,'k','linewidth',2);
We will use the difference between the 2 cumulative density functions in the next lab. Function referenceMatlab functions and operatorsUser functions
Copyright 20112012 Stephen D. Van Hooser, all rights reserved.

Updating...
Steve Van Hooser, Sep 8, 2011, 7:18 AM
Steve Van Hooser, Sep 8, 2011, 7:18 AM
Steve Van Hooser, Sep 8, 2011, 7:40 AM
displaydrugvsplacebo.m (3k) Steve Van Hooser, Sep 8, 2011, 7:34 AM
Steve Van Hooser, Sep 8, 2011, 7:34 AM
generate_random_data.m (1k) Steve Van Hooser, Sep 8, 2011, 7:34 AM
Steve Van Hooser, Sep 27, 2011, 8:51 AM
