Lab 2.4 Dealing with a truckload of data
Introduction
When one conducts a research study, one typically performs several experiments, and often stores the results in some organized way on the computer. Therefore, it often makes sense to design your analysis code so that it can traverse the directories where the data is stored, perform any analysis (such as fitting or computing index values), and then calculate statistics.
This style offers several benefits, including: 1) it forces consistent treatment of data from different experiments, 2) it allows you to modify or improve your analysis code and then repeat all of the calculations in an automated fashion, and 3) with a little extra code, you can include quality control procedures that monitor the quality of fits or index values for all of your individual experiments (more on this at the end of the lab).
In this lab we will use example data from an imaginary experiment that has some "real life" complexity. We will imagine that we are treating tumors in rats with a drug or placebo. Rats with liver tumors of various sizes are randomly assigned to a group, and the size of the tumor is monitoring with medical imaging technology over about 1 month. Let's imagine that we already have a well-tested function that extracts 2 quantities: 1) the percentage change in the size of the tumor over the observed period, and 2) the rate of growth/shrinkage of the tumor, as determined by an exponential fit. Let's not focus on the details of the function here so that we can instead focus on dealing with all of the data in our experiment. Download the function tumorfit.m, tumorplot.m and the example data tumor_example.txt. Put tumorfit.m and tumorplot.m into a a new package called +dasw/+tumor. Like before, you should add this directory to your search path using addpath(genpath('YOURPATH\MATLAB\+dasw\'));
help dasw.tumor.tumorfit % read the help
help dasw.tumor.tumorplot % read the help
tumor_data = load('tumor_example.txt','-ascii');
[change, rate,a,b,c,d] = dasw.tumor.tumorfit(tumor_data,5);
figure;
dasw.tumor.tumorplot(tumor_data,a,b,c,d);
Dealing with whole data sets, with results from individual experiments stored in different directories
Imagine that you have performed experiments on 30 rats in each group (that is, 30 in the placebo group, and 30 in the treatment group). The data files that you produce might be organized into directories like in the file tumor_study.zip (see list of files below). Download this file now and then 1) using the Macintosh Finder or the Windows file system, put tumor_study.zip in your data directory and 2) right-click on the file and choose "uncompress" or "unzip". You should see a new directory called tumor_study.
Let's first take a look at this information on disk. Double click on the tumor_study directory, and you'll see 2 directories inside, placebo and drug. Inside each of these directories, are experiments labeled exp001, exp002, etc. Inside these directories, are the raw data files called tumor_data.txt. The file tree, starting with tumor_study (and not including your data directory), looks like this:
tumor_study/
placebo/
exp001/
tumor_data.txt
exp002/
tumor_data.txt
...
exp030/
tumor_data.txt
drug/
exp001/
tumor_data.txt
exp002/
tumor_data.txt
...
exp030/
tumor_data.txt
Change the current directory so that you are in the tumor_study directory.
Let's look at the files here:
ls % remember this is the letter 'l' followed by s, not the # 1 followed by s
You should see a folder called placebo and another called drug.
We could explore the files that are present here by using the cd command to change the current directory into all of the subdirectories. There's nothing wrong with this, but given all of the directories that we have here, it might be easier to keep the present working directory where it is, and refer to files and directories by their full paths or partial paths like the following:
A partial path refers to a file or its directory relative to the current directory, although the file need not reside in the current directory. For example:
ls placebo
(On a Mac):
ls placebo/exp001 % refers to exp001 relative to the current directory
ls placebo/exp001/tumor_data.txt
td = load('placebo/exp001/tumor_data.txt','-ascii');
(On a PC)
ls placebo\exp001
ls placebo\exp001\tumor_data.txt
td = load('placebo\exp001\tumor_data.txt','-ascii');
Often when we are using Matlab, we may not want to assume that the user is currently in any particular directory; for this reason it is often useful to specify a file by its full path. On my computer (a Mac), my tumor_study folder is located at
[userpath filesep 'data' filesep 'tumor_study']
And I can list these files using the full path
ls [userpath filesep 'data' filesep 'tumor_study' filesep 'placebo']
ls ls [userpath filesep 'data' filesep 'tumor_study' filesep 'placebo' filesep 'exp001']
said this approach was easier than changing the current directory, but you may be wondering how this is possible. Look at all the text you had to type. We've typed the text to understand how these paths work, but the key is to have Matlab do the work for you. This is what we'll turn to next.
String variables
Up to now, when we have specified a file, we have always done this by typing in the filename into the command line directly. It is also possible to store the name of a file in a string variable.
myfilename = 'placebo/exp001/tumor_data.txt'
td2 = load(myfilename,'-ascii');
Here we loaded the file by passing a string variable rather than typing in the string manually. Let's focus on strings for a few minutes.
At their base level, string variables are lists of numbers, but we are instructing Matlab to interpret these numbers as text. The basic type of string variable in Matlab follows the ASCII (American Standard Code for Information Interchange) code. It's a little like those secret decoder rings you got in your breakfast cereal when you were a kid, there are numbers that represent each text symbol. It's not important to know the actual code, but it will help you understand how string variables work to know that there is an underlying code.
Let's explore this numeric code a little bit. We specify a variable as a string by using single quotes. Everything inside the single quotes gets put into the string.
myS = 'This is a test'
We can decode the string by converting it to a 'double', which is the data type for real valued numbers:
double(myS)
We can also do the reverse, by specifying a list of numbers and converting it to a string; the name of the string type is char, which stands for character.
S = [ 65 66 67 68 69]
char(S)
Q1: What is the string representation of S? By examining the numeric code for myS, what is the numeric code (that is, the ASCII code) for a space?
We can perform many of the same operations on strings as we do other variables.
str1 = 'Hello'
str2 = 'World'
We can concatenate 2 strings using the same notation we use for concatenating numbers:
str3 = [ str1 str2]
str4 = [ str1 ' ' str2]
str5 = [ 'This is a test.' ' ' 'Yes, it is.']
Compare this to numeric concatenation:
a = [ 5 ]
b = [ 6 ]
c = [ a b ]
There are a number of nice functions that are specifically made for strings that do things like compare strings (strcmp), convert to upper (upper) and lower (lower) case, etc. See help strings and help strfun. Let's play with strcmp briefly:
strcmp('hello','Hello')
strcmp('Hello','Hello')
strcmp('Hi','Hello')
We can also use NOT operator, ~, to flip the logic here. For example:
~strcmp('Hi','Hello')
returns true if 'Hi' is different from 'Hello'.
Strings and file paths
When it comes to dealing with files that are embedded deep in directories, strings and string functions can come to our aid to save us some work.
Change your current directory so you are in the tumor_study directory (if you're not there already). Then let's store the current directory in a string variable:
mypwd = pwd
mydir = 'placebo'
myexp = 'exp001'
dfile = 'tumor_data.txt';
Suppose we wanted to load this file. We know on a Mac we'd have to use the file separator '/', or on a PC we'd have to use the file separator '\' (on Unix we'd use '/', on a Mac OS 9 machine we'd use ':'). We'd like our code to work on any computer, so there is another special command called filesep that returns the string for the file separator regardless of what computer is being used:
filesep
So we can write this on any computer:
myfilename2 = [mypwd filesep mydir filesep myexp filesep dfile]
td3 = load(myfilename2,'-ascii');
As a shorthand, one can also use the command fullfile, which simply concatenates directory names putting a filesep in between (see help fullfile):
myfilename3 = fullfile(mypwd, mydir, myexp, dfile)
Examining files and the directory structure
Importantly, we can also use Matlab to examine the files and directories that are present so that we can write code that will automatically perform analysis. The function that examines the directory tree is called dir (see help dir):
Let's change directories so that you are in the data directory above the tumor_study directory:
myDir = dir('tumor_study')
On my computer, I get the following output:
myDir =
4x1 struct array with fields:
name
date
bytes
isdir
datenum
This is a new type of variable that we haven't seen before, called a structure. Each element of a structure array has multiple fields that have names and values.
Let's look at the first value of the array:
myDir(1)
You can see the directory has a name '.', a date, which corresponds to when the file was modified, the size of the file (bytes), a 0/1 value indicating whether or not the file is a directory, and the modified time in Matlab's date format which is called datenum (not important).
One can access the fields of a structure using a period (not to be confused with the element-by-element multiplication operator):
myDir(1).name
You may be asking yourself, why is there a directory named '.'? You don't see it in Windows or Mac OS, so what is it doing there? The reason is that it is a feature of the unix operating system that Matlab has implemented. The name '.' refers to the current directory, and the name '..' refers to the previous directory or "parent" directory. Check it out:
ls
ls .
ls ..
The upshot for us is that there are 2 filenames that are almost never of interest to us. So we can ignore them, as we do in the function below (in the section "Putting it all together").
Let's look at some other files:
myDir(3)
myDir(4)
Q2: Is this file a directory?
We can use this directory information to examine the subdirectories. Find the index of myDir that corresponds to the 'drug' or 'placebo' directory (it might be 3, or 4, or 5), and try that:
myindex = 4;
subdir = dir(['tumor_study' filesep myDir(4).name])
Let's look at these files:
subdir(3)
subdir(4)
Performing an action on each subdirectory
Now we need a means of traversing a subdirectory tree to examine all of the files. We can do this with a for loop.
For loops
Up to now, when we wanted to perform an action over and over, we used a while loop. It is true that the while loop is the only loop one ever truly needs. But, if you know in advance the number of times you want to repeat a task, another loop, called the for loop, is a little bit easier. Let's compare and contrast.
Here is the while loop that you already know:
i = 1;
while i<5,
i,
i = i + 1;
end
The above while loop prints the value of i 5 times. We can do the same with a for loop:
for A=[1 2 3 4 5],
A,
end;
Here we have specified the values that A will take ahead of time. It is a little shorter to write (2 lines shorter). We have a little less flexibility than with a while loop; for example, we can't change the value of A in a way that will propagate through the loop, like this:
for A=[1 2 3 4 5],
A,
A = A+2;
end;
Setting the variable at the end of the loop has no effect, because A is set to the next value in the array [1 2 3 4 5] each time the program enters the beginning of the loop.
If you don't need the flexibility of the while loop, and you know how many steps you want to accomplish, then I recommend using a for loop. It is a couple fewer lines, and it also leaves less room for screwing up if you edit your looping variable (the i in the while loop example above) incorrectly.
Let's explore a few more common forms of the for loop:
a = 1:5,
for A=a,
A,
end;
for A=1:5, % no need to create a separate variable just for the loop
A,
end;
Very commonly, we will want to loop through an array of a fixed length, but whose length is not known when we are writing the code (or might vary from run to run). We can do that by using the length function to determine the number of loops we will run:
a=1:5,
for A=1:length(a),
a(A), % each time through the loop we will print the Ath entry of a
end;
Now let's loop through all the subdirectories and print their names:
for i=1:length(subdir),
['The name of the directory is ' subdir(i).name '.'],
end;
Putting it all together
Now we can put all that we've learned into a single function that will crawl through the data and analyze our experiment. Enter this function line by line so you can understand all of the steps.
analyze_tumors.m
function t = analyze_tumors(folder, conditionName)
% ANALYZE_TUMORS - Analyze folder of data
% T = dasw.tumor.analyze_tumors(FOLDERNAME, CONDITIONNAME)
%
% Steps through all subdirectories of the folder
% FOLDERNAME and looks for files called
% 'tumor_data.txt'.
%
% If it finds one, then it performs a fit
% using dasw.tumor.tumorfit.m.
%
% The function then builds a table T of all such entries with
% The CHANGE, RATE, and A, B, C, D parameters returned. There
% is a column "condition" that is set to CONDITIONNAME.
%
subdirs = dir(folder);
t = []; % start with an empty variable
for i=1:length(subdirs),
if subdirs(i).isdir&~strcmp(subdirs(i).name,'.')&~strcmp(subdirs(i).name,'..'),
filename = [folder filesep subdirs(i).name filesep 'tumor_data.txt'];
if exist(filename),
disp(['Analyzing file ' filename '.']);
data = load(filename,'-ascii');
[change_,rate_,a_,b_,c_,d_] = dasw.tumor.tumorfit(data,5);
t_here = array2table([change_ rate_ a_ b_ c_ d_],...
'variableNames',["Change" "Rate" "a" "b" "c" "d"]);
t_condition = array2table(string(conditionName),...
'variableNames',"Condition");
t_filename = array2table(string(filename),...
'VariableNames',"File");
t = [t ; t_here t_condition t_filename];
end;
end;
end;
Assuming you are in the directory data that is one level above the tumor_study directory, you can use:
t_drug = dasw.tumor.analyze_tumors('tumor_study/drug','Drug');
t_placebo = dasw.tumor.analyze_tumors('tumor_study/placebo','Placebo');
t = [t_drug ; t_placebo] % combine into one table and look at it
Q3: Generate a figure that plots a cumulative histogram of the parameter change for the drug and placebo conditions. What statistical test should you use? Do you think a t-test or a kruskal-wallis test makes the most sense?
Verifying that index values / fits are good
So how do we know that our fits and index value extraction are doing a good job for such a big dataset? Ideally, we'd like to have a brief method for plotting out all of our fits and index values so we can take a look at the whole set. Doing this manually would take a long time, so let's write some code to do the work for us.
The first piece of code that will help is a utility function that I wrote for creating and addressing axes across multiple figures. If we were to try to plot our entire data set on a single figure, the resulting plots would be so small as to be uninterpretable. On the other hand, if we pulled up a new figure window for each plot, we'd soon be overwhelmed with windows. So ideally we would like a tool that allows us to pull up plots across multiple panels of multiple figures, like supersubplot.m. (I recommend putting this function in your +dasw/+plot package.)
supersubplot.m
function ax_h = supersubplot(fig, m, n, p)
% SUPERSUBPLOT - Organize axes across multiple figures
%
% AX_H=dasw.plot.supersubplot(FIG, M, N, P)
%
% Returns a set of axes that can be arranged across multiple
% figures.
%
% Inputs:
% FIG - figure number of the first figure in series
% M - The number of rows of plots in each figure
% N - the number of columns of plots in each figure
% P - the plot number to use, where the numbers go
% from left to right, top to bottom, and then
% continue across figures.
%
% Outputs: AX_H - the axes handle
%
% Example:
% fig = figure;
% ax = supersubplot(fig,3,3,12);
% % will return a subplot in position 3
% % on a second figure
%
% See also: SUBPLOT, AXES
if isempty(fig), fig = figure; end; % make it if it's an empty variable
if ~ishandle(fig), fig = figure; end; % make it if it's not a good figure
ud = get(fig,'userdata'); % read the userdata field, which is for our use
if isempty(ud), ud = fig; end;
figure_number = ceil(p/(m*n)); % round up to nearest integer
if length(ud)<figure_number,
ud(figure_number) = figure;
else,
figure(ud(figure_number));
end;
plotnumber = p - m*n*(figure_number-1);
set(fig,'userdata',ud);
ax_h = subplot(m,n,plotnumber);
Now we can add a couple of lines to our analyze_tumor.m function that will perform the plotting.
analyze_tumor_plot.m
function [t] = analyze_tumors_plot(folder, conditionName, plotit)
% ANALYZE_TUMORS - Analyze folder of data
% T = dasw.tumor.analyze_tumors_plot(FOLDERNAME, CONDITIONNAME, PLOTIT)
%
% Steps through all subdirectories of the folder
% FOLDERNAME and looks for files called
% 'tumor_data.txt'.
%
% If it finds one, then it performs a fit
% using dasw.tumor.tumorfit.m.
%
% The function then builds a table T of all such entries with
% The CHANGE, RATE, and A, B, C, D parameters returned. There
% is a column "condition" that is set to CONDITIONNAME.
%
% If PLOTIT is 1, then the function also plots each fit.
%
subdirs = dir(folder);
t = []; % empty variable for table
if plotit, fig = figure; else, fig = []; end;
plot_number = 1;
for i=1:length(subdirs),
if subdirs(i).isdir&~strcmp(subdirs(i).name,'.')&~strcmp(subdirs(i).name,'..'),
filename = [folder filesep subdirs(i).name filesep 'tumor_data.txt'];
if exist(filename),
disp(['Analyzing file ' filename '.']);
data = load(filename,'-ascii');
[change_,rate_,a_,b_,c_,d_] = dasw.tumor.tumorfit(data,5);
if plotit,
dasw.plot.supersubplot(fig, 3, 3, plot_number);
dasw.tumor.tumorplot(data, a_, b_, c_, d_);
title([subdirs(i).name ', C%=' int2str(change_) ', R=' num2str(rate_,2)]);
plot_number = plot_number + 1;
end;
t_here = array2table([change_ rate_ a_ b_ c_ d_],...
'variableNames',["Change" "Rate" "a" "b" "c" "d"]);
t_condition = array2table(string(conditionName),...
'variableNames',"Condition");
t_filename = array2table(string(filename),...
'VariableNames',"File");
t = [t ; t_here t_condition t_filename];
end;
end;
end;
[t_drug]=dasw.tumor.analyze_tumors_plot('tumor_study/drug','Drug',1);
[t_placebo]=dasw.tumor.analyze_tumors_plot('tumor_study/placebo','Placebo',1);
t = [ t_drug; t_placebo];
Q4: Does the change index value do a good job of capturing the change in tumor sizes for each and every bit of data? Does the rate index, which is based on the fits, do a good job? If you wanted to describe the efficacy of your drug in an academic journal article, which index value would you choose to publish?
Function reference
Matlab functions and operators
User functions
dasw.tumor.tumorfit.m
dasw.tumor.tumorplot.m
dasw.tumor.analyze_tumors.m
dasw.plot.supersubplot.m
dasw.tumor.analyze_tumors_plot.m
Labs are copyright 2011-2012, Stephen D. Van Hooser, all rights reserved.