Lab 1.4 M-file Management; Differences in cumulative histograms
Introduction
In this lab, we will focus on organizing our workspace and files so that we can work efficiently.
In the second half of the lab, we will turn our attention back to samples and distributions, and specifically how we can perform mathematical operations on samples.
Efficient M-file management
In the previous labs and in the homework, we created several M-files, including some scripts and some functions. In all of these labs, we have simply changed the current directory so that it corresponded to the directory where our M-files were located. However, imagine if one kept adding M-files to the same directory. After a few months of work, the directory would be full of dozens of M-files that may or may not be related to one another. So it is helpful to organize the files and to tell Matlab to look in more than one directory to find its M-files.
There is no one right or wrong way to curate your Matlab files, but learning to do it in some way will dramatically improve your efficiency as the number of projects (and labs and homeworks) you are working on increases. We will spend some time manually organizing our files to see how this is done.
A suggested file organization scheme
In the past few years, the vast majority of students have completed their work on their own laptop. We will manage these files on your own computer.
Find the MATLAB folder on your computer. On Macs, the MATLAB folder is almost always located at /Users/YOURUSERNAME/Documents/MATLAB , where YOURUSERNAME is your user name. On PCs, the folder is usually located at C:\Users\YOURUSERNAME\Documents\MATLAB .
You can find this location on the Matlab command line as well using the command userpath:
userpath
Now let's create a directory called MATLAB in your UNET folder (using either Windows or the Finder on a Mac), and a subdirectory called tools.
Using either the Macintosh Finder or Windows, make 2 new directories within the MATLAB folder, 1 called data, 1 called +dasw. Putting a + in front of the directory name makes the directory a package, which allows you to organizer your files by topic. Within the +dasw directory, make a new folder called +plot, a new folder called +stats. In addition, in the Matlab folder, create a new folder called scripts. Within the scripts directory, make a new folder called Labs, and another called Homework.
The files that we have created can be downloaded by clicking on the links in the directory tree below. Download these files, and put them in the appropriate folders. (Note: the files change_mineral_...txt are data files we will examine today.)
Now your directory structure should look something like this:
[your documents directory]/MATLAB/tools/
data/
change_mineral_breastfeeding.txt
change_mineral_other.txt
scripts/
Labs/
Homework/
+dasw/
+plot/
cumhist.m
displaydrugvsplacebo.m
histbins.m
+stats/
drugvsplacebo.m
generate_random_data.m
Adding directories to the Matlab search path
Now that we have our files organized, we have to tell Matlab to look in these directories for M files. We do this by adding directories to Matlab's search path. The path is a list of directories that Matlab examines to find M-files. We can see the current directories on the path by typing the command path:
path
Since we haven't added anything to the path yet, most of these directories will be Matlab's own function directories, which The MathWorks calls toolboxes.
To do this, let's specify our new MATLAB directory in a variable. These will be character string variables, where the string is specified between single quotes (as in Lab 1.2):
Write
mydirname = [ userpath filesep 'tools']
To make sure we wrote it correctly, let's try to list the files in that directory using the command ls that we used in the first lab:
ls(mydirname)
Now we can generate our own list of directory paths to add to Matlab's path by using the command genpath:
p = genpath(mydirname)
Now we can add these directories to the path with the command addpath by typing
addpath(genpath(mydirname))
The functions in packages are referred to by a name that includes the package name(s). For example, now that histbins is located in the package plot which is inside the package dasw, it must be referred to by dasw.plot.histbins.
We can verify that Matlab can find our functions by using the which command (see help which).
which dasw.plot.histbins
Q1: What is the output of which dasw.plot.histbins?
We can get a list of our newly modified path directories with path:
path
Resolving path and file ambiguity with which
File this note away for down the road: the command which is very useful for diagnosing issues with your path. If a function doesn't seem to be doing what you think it should be doing, try typing which functionname to make sure Matlab is grabbing the function from the directory it should be. This is particularly true if you begin to exchange files with friends; your friend could have the same filenames for some functions as you do. Matlab will grab the one that is "earliest" in its path but it won't give you an error that says "hey, there are 2 functions with the same name".
Packages go a long way to solving function name ambiguity!
The startup.m file
If you work on a computer consistently (for example, your laptop), you may find it annoying to type in:
mydirname = [ userpath filesep 'tools'];
addpath(genpath(mydirname))
each time that you start up Matlab so that your files are added to the path. You can create a file called startup.m that is saved in your userpath folder. This file will be run each time you start up Matlab. Simply add these commands to your startup.m file, and then your directories will be automatically added to the path when you start up. I will add a display call so that you will get a notice that the command is running.
mydirname = [ userpath filesep 'tools'];
disp(['Adding ' mydirname ' and all subdirectories to the path.']);
addpath(genpath(mydirname))
clear mydirname; % no need to keep this variable around
Differences in cumulative histograms of 2 sets of samples
In our next lecture, we will address how one can tell, statistically, if 2 distributions are different. Before we do that, we need to understand just a little bit more about math using sample distributions in Matlab.
One feature that has been shown to be of interest by mathematicians is the maximum difference (in absolute value) between the cumulative histograms of 2 samples. When you played the drugvsplacebo game, you probably picked up on the fact that the underlying distributions were likely to be different when the cumulative histograms differed greatly. We're talking about the following quantity that is denoted in black:
We'll look at this difference using real data found in the files change_mineral_breastfeeding.txt and change_mineral_other.txt. Breastfeeding women produce milk, which contains large amounts of calcium. This study wanted to examine if the process of breastfeeding reduced minerals that were present in the bones of women. To do this, the scientists measured the mineral content of the spinal cord in a group of breastfeeding women and other women over a time span of 3 months.
cd ([userpath filesep 'tools' filesep 'data'])
sample1 = load('change_mineral_breastfeeding.txt','-ascii');
sample2 = load('change_mineral_other.txt','-ascii');
[X1,Y1] = dasw.plot.cumhist(sample1);
[X2,Y2] = dasw.plot.cumhist(sample2);
figure(90);
plot(X1,Y1,'bo-');
hold on;
plot(X2,Y2,'ro-');
axis([-10 10 0 100]);
xlabel('Change in mineral content');
ylabel('Percent of samples');
title('Blue=breastfeeding women, Red=other women');
We want to calculate the maximum difference between these 2 cumulative histograms. To do this, we need to measure the distance between the distributions (along the y axis) at each location on the x-axis. However, we actually don't know the value of the function at each location of X. If you use the "zoom" tool to zoom in on the data points, you'll see that the values of X that comprise the plot are at different locations. In order to calculate the distance between these 2 curves, we need a common reference on the X axis.
Creating cumulative histograms with a common X-axis
This is an example of a situation that arises frequently in real-world data analysis. We have the data to make the calculation that we'd like to make, but we must first make some transformation to get it into a form that we can work with. In this case, that transformation is to generate cumulative histograms such that the X-axis points are in the same locations.
To create a common X axis, we will create a histogram, but we will choose bin edges such that they match the individual data points. Then, we will build up the cumulative histogram by adding up the number of points that fall into each bin. It may be easier to see this as an example:
To choose bin edges that match our data points, we will use the function sort to sort the data. We will also use the function unique to eliminate duplicate data points. Try it out:
sort(sample1)
% now let's sort sample1 and sample2 together;
% [sample1; sample2] makes a big list of the samples
sort([sample1; sample2])
unique(sample1)
unique([sample1; sample2])
Now we'll make a histogram using all of these values for the bins:
bin_edges = unique([sample1; sample2]);
bin_counts1 = histc(sample1, bin_edges);
bin_counts2 = histc(sample2, bin_edges);
Let's take a look at these results so far:
figure;
bar(bin_edges, bin_counts1);
title('Histogram 1');
figure;
bar(bin_edges, bin_counts2);
title('Histogram 2');
We will next employ the function cumsum to provide the cumulative sum of the values in our bins. To see how this function works, try it out:
A = [ 1 2 3 4 5];
cumsum(A)
So, at each location, cumsum returns the cumulative sum of the elements in the vector up to that point. We can use this to make a cumulative histogram:
cumsum1 = cumsum(bin_counts1) / sum(bin_counts1);
cumsum2 = cumsum(bin_counts2) / sum(bin_counts2);
We need to divide by the number of data points (the sum of the bin counts) so that the values are between 0 and 1; if we multiplied the above result by 100, then we would get values between 0 and 100%.
figure;
plot(bin_edges,cumsum1,'b-');
hold on;
plot(bin_edges,cumsum2,'r-');
xlabel('Sample values');
ylabel('Cumulative sum of samples');
Finding the value and location of the maximum difference between 2 curves
Now we have 2 curves with a common X axis. All that remains is to find the value of the maximum difference (in absolute terms) between these curves. We can do this in 3 small steps:
We can calculate the point-by-point difference between these curves by subtracting 2 variables. For example:
A = [ 1 2 3 4 5];
B = [ 9 8 7 6 5];
B-A,We can use the function abs to calculate the absolute value of the result:
abs(-1),
abs(B-A),
abs(A-B),We can use the function max to find the value and the location of the largest point of a vector:
[maxvalueA,maxlocationA] = max(A)
[maxvalueB,maxlocationB] = max(B)
[maxdiff,maxdiff_location] = max(abs(B-A))
Now let's put all of these steps together into an M-file (you can copy and paste this into a new M-file). Save this file in your functions/+stats directory:
cumulative_hist_diff.m
function [maxdiff, maxdiff_location, Xvalues, sample1CDF, sample2CDF] = cumulative_hist_diff(sample1,sample2)
% CUMULATIVE_HIST_DIFF - Calculates the maximum difference b/w 2 samples
% [MAXDIFF,MAXDIFF_LOCATION, XVALUES, SAMPLE1CDF, SAMPLE2CDF] = ...
% dasw.stats.cumulative_hist_diff(SAMPLE1, SAMPLE2)
%
% Inputs: SAMPLE1 - an array of sample data
% SAMPLE2 - an array of sample data
% Outputs: MAXDIFF - The maximum (absolute) difference between the samples
% MAXDIFF_LOCATION - The x-axis location of the maximum difference
% XVALUES - The X-axis values for the cumulative density functions
% SAMPLE1CDF - The cumulative density function of SAMPLE1
% SAMPLE2CDF - The cumulative density function of SAMPLE2
bin_edges = [unique([sample1; sample2])];
bin_counts1 = histc(sample1, bin_edges);
bin_counts2 = histc(sample2, bin_edges);
Xvalues = bin_edges;
sample1CDF = cumsum(bin_counts1) / sum(bin_counts1);
sample2CDF = cumsum(bin_counts2) / sum(bin_counts2);
[maxdiff, maxdiff_location] = max( abs(sample1CDF-sample2CDF) );
Calculating the maximum difference between 2 cumulative histograms
Now let's use our M-file to calculate the maximum difference between the 2 samples (in absolute value).
[maxdiff, maxdiff_location, Xvalues, sample1CDF, sample2CDF] = ...
dasw.stats.cumulative_hist_diff(sample1,sample2);
figure;
plot(Xvalues,sample1CDF,'b-');
hold on;
plot(Xvalues,sample2CDF,'r-');
xlabel('Change in mineral content');
ylabel('Percent of samples');
title('Blue=breastfeeding women, Red=other women');
Now let's add a bar showing the maximum difference at the proper location:
X_loc = [ Xvalues(maxdiff_location) Xvalues(maxdiff_location) ]
Y_loc = [ sample1CDF(maxdiff_location) sample2CDF(maxdiff_location)]
plot(X_loc, Y_loc,'k-','linewidth',2);
We will use the difference between the 2 cumulative density functions in the next lab.
Function reference
Matlab functions and operators
User functions
dasw.stats.cumulative_hist_diff.m
Copyright 2011-2012 Stephen D. Van Hooser, all rights reserved.