Lab 1.1 Getting started: Matlab and Data Analysis/Stats basics

Before you start: how to get the most out of the labs

Some of you may be doing these labs for credit in the course, others may be doing them to learn how to use Matlab on your own or as you are working in a lab. To get the most out of the labs:

  • Type in the example code yourself, rather than cutting and pasting. If you type it out yourself, you'll start to develop familiarity with the language.

  • Make sure you go slowly enough for the material to sink in. We all get used to skimming web pages, but you'll get more out of the tutorial if you can slow yourself down and think as you read.

  • Even if you are not taking the course for credit, answer the questions as you go. This process will help to ensure that you are thinking about the material as you go.

What is Matlab, and why are we going to use it in this class?

Matlab, made by The MathWorks, is a programming environment for working with numerical data and, to a lesser extent, symbolic equations. Matlab is an interpreted language, which means the code is compiled and run as you type it in. For this reason, it is an easy environment in which to perform a few manipulations on some data and plot the output without having to include a lot of the basic declarations required by more traditional programming languages. In addition, Matlab contains a vast array of built-in functions for performing manipulations on data.

Why not just use some program with all pre-canned statistical tests, like Excel or SPSS or <insert your favorite stats package here>?

Spreadsheets and dedicated programs are decent means of performing statistical analyses, and probably the most common. If the data types you are examining are very simple, and if some device or hardware that you are using to acquire your data can provide the data in this simple format, then you may be able to get away with using such a package. Indeed, after the course, many of you may choose to proceed with such a package. But there are at least 4 good reasons to learn a computing language as opposed to only using packages:

  1. To understand what is going on behind-the-scenes in these built-in packages, so you use them properly (and reduce the likelihood of generating erroneous conclusions)

  2. To develop custom analyses when your question isn't addressed by a common, pre-canned test

  3. If necessary, to process or transform your data so it can be addressed by a common test (transformations or processing is often required in the sciences, especially biology, neuroscience, physics, economics)

  4. To automate your workflow so you can spend more of your time focused on the meaning of your data rather than on the process of analyzing the data

Starting Matlab

To start Matlab, find the Matlab application on your computer. (Don't have it? If you are at Brandeis you can install it with our site license, see here.)

  • On a Mac OS X machine, it is likely in the Applications folder in the Dock along the bottom of the screen. Click the Matlab application.

  • On a Windows XP machine, click the "Start" button along the bottom of the screen, choose "Programs", and find and select the Matlab application.

When you start Matlab, you are presented with at least one (and maybe several) windows. The most important window is the Command Window, which contains the command line:

The command line

The command line is the primary way that one communicates with Matlab. Although most of you probably grew up controlling your computer with the mouse and mainly using the keyboard to type text, command lines were the most common way of interacting with computers 15-30 years ago.

So why is it good that we are entering commands like people did 15-30 years ago? The reason is that a command line interface gives the user the infinite flexibility of a real language to tell the computer exactly what he/she wants it to do. Rather than clicking on a few buttons or menu options that the programmer thought were most useful, you can specify a variety of different types of commands and loops and iterations using an actual language. (Okay, there are "graphical" programming languages, but in my opinion it takes a really long time to express one's code in the graphical languages I have seen.)

There are many ways to learn a new language. One approach is to study the grammar and some vocabulary before illustrating with examples. Here, we'll do the opposite, building up to the grammer slowly over a few weeks and just immediately start with a few examples to get your feet wet.

So let's jump right in and type a little bit on the command line. The command line is inside the command window and looks like 3 right triangular brackets: >>>. Text that is literally in the Matlab language, that you should typically write on the command line, is indicated by the Courier font. Go to the command line and type:

a = 5

You've just created a variable called a and set it equal to 5. If you now type

a*5

you get the answer.

Now type

b = a*5;

You've just made a new variable b, and set it equal to the product of the variable a and 5. By adding a semi-colon, you've suppressed the output from being printed. Suppressing the output will be important later.

In Matlab, all variables are real-valued matrixes (of type "long double" for the programmers out there). This makes it easy to perform manipulations on groups of related numbers at the same time. You did not realize it, but the variables you created above are 1x1 matrixes. Let's see how this works. Type

a = [ 1 2 3 4 5]

b = power(a,2)

You've just recreated the variable a to be the array (1-D matrix) [1 2 3 4 5], and set b to be equal to the square of each of its elements. You can access values of a and b by using parenthesis. Type

a(3)

b(3)

a(4)

b(4)

and you see the values. You can also use b(end) or a(end) to see the last element.

Commands and the file path

Matlab, like all computer languages, has a variety of built-in commands and operators. But its power comes from the ability of the user to add new commands. We will revisit this in considerably more detail later. For now, I want to discuss the minimum amount of information that will allow you to download some programs that I have written and to run them.

When you look at the Command Window, you might notice that there is a directory at the top of the window (in my case '/Users/vanhoosr/Documents/MATLAB', but it will vary from computer to computer). This is the current working directory; the working directory is the location on the computer's disk where Matlab will look first if the user types a command or runs a command that tries to load or write to a file.

The current working directory is listed above the command line as I mentioned, but you can also enter a command on the command line to report its value (pwd stands for "print (current) working directory"):

pwd

will display the current directory. You can list the files in this directory with the command ls (this is the letter 'l', not the number 1).

ls

Let's add some files to our file path. At the bottom of this document (in the section called Attachments), there is a file called Lab1_1.zip. Download this file to your disk (make sure you click the downward arrow to the right of the file name to download it; clicking on the name below will merely show you the contents on-line), and double-click it to extract the contents. It should create a directory/folder called Lab1_1. Now, drag this folder Lab1_1 into the Current Folder window that is normally to the left of the Command Window. You should now be able to see it in the Current Folder window, and be able to see it if you list the files that are present:

ls

(you should see, among any other files, Lab1_1)

Now let's change the current working directory to Lab1_1 so that we can run the files within.

cd Lab1_1

Now when you list the files, you should see the programs that we will run for the rest of the class.

ls

You should see displaydrugvsplacebo.m, drugvsplacebo.m, generate_random_data.m.

Data analysis warm-up

For the rest of the lab, we're going to turn our attention to a game that builds intuition for performing data analysis. The game mimics a clinical trial for a drug to treat a fatal disease. The imagined trial lasts 20 years; some of the patients take a placebo that looks and smells just like the real drug, while others get the actual drug. Neither the doctor nor the patient know what they are taking (only the statisticians!). At the conclusion of each experiments, the patients are monitored to see how long each lives relative to the end of the study. (Negative numbers indicate the patient died during the study.)

We will actually spend several class periods studying the results of experiments and games like these. We will learn how to analyze this data both intuitively/graphically and with statistics.

Lists of numbers

In the game you will see the results of trials with a drug or a placebo. At the end of the results, you'll be asked to pick whether you would take the drug or decline (or take drug A vs. drug B). You'll then learn how your decision fared, as well as how 100 people who made the same choice faired. You'll play several times.

Q1: Keep track of your "score" for 5-10 games. By score, I mean 1) whether you were better off with your choice, and 2) how many out of 100 people were better off?. Write down how you fared, and how 100 additional participants fared. This bold Q# symbol in each lab indicates something that you should write down to turn in at the end of the class. Click here to read how to prepare a Word document to turn in by email at the end of class.

To play, type

drugvsplacebo

(Did you get an error that says undefined function icdf? This means you are lacking the statistics toolbox. Please check your Matlab installation again.)

Summaries of lists of numbers: Mean, Median, Percentile range

Now instead of seeing all of the data, you'll see summaries of the data. Q1 (continued): Again, keep track of your scores. Type

drugvsplacebo('mean')

drugvsplacebo('median')

drugvsplacebo('percentilerange')

Q2: Would you rather look at the whole distribution, the mean, the median, or the percentile range to make your decision?

Graphical summaries of lists of numbers: Bar graphs, Histograms, Cumulative histograms

Now we'll play the game with some common graphs. First, we'll start with a bar graph of the mean and the individual observations. Q1 (continued): Keep tracking those scores.

drugvsplacebo('bargraphrange')

Now a "histogram" of the data. In a histogram, the score values are divided into "bins" along the X axis, and the number of scores that fall within each bin is reported on the Y axis. For example:

(from Wikipedia: histogram)

Q1 (continued): Keep tracking those scores.

drugvsplacebo('histogram')

Finally, we'll examine a cumulative histogram. This is a graph of the values of the results on the X axis, and what percentile of the data has been accounted for on the Y axis. For example:

This graph shows the distribution of direction index values for visual cortical neurons in 2 populations of ferrets (those with visual experience, and those without). The 50% line (dashed) indicates the median of the 2 distributions. Clearly, experienced animals exhibit higher median direction index values. You can read off any percentile value by looking at the Y axis on the left, and following across to find the corresponding X axis value. (From Li/Van Hooser et al., 2008).

drugvsplacebo('cumulativehistogram')

Q1 (continued): Keep tracking those scores.

Q3: Which graph styles worked well for playing the game, and which were less informative? Why?

Function reference

Matlab functions and operators

  • = (assignment)

  • ; semicolon

  • [ ] - matrix enclosure (concatenation)

  • () - element selection

  • pwd

  • ls

  • cd

User functions

  • drugvsplacebo

Copyright 2011-2012, Stephen D. Van Hooser, all rights reserved.