Lab 5.1 High-dimensional space: data points as vectors
Over the past 2 decades, new biological tools and additional computing power have allowed scientist to collect and examine more data than ever before. A scientist in Eve Marder's lab here at Brandeis created 20 million different models of a 3-cell pacemaker circuit and discovered that a variety of combinations of cell intrinsic properties and synaptic properties can give rise to circuits that perform the same action (Prinz and Marder, 2004). Scientists in Sacha Nelson's lab used microarrays to examine the RNA expression patterns of 12,556 genes in 12 different cell types in the brain in order to understand which patterns of genes are expressed in different cell types (Sugino et al. 2006). And scientists in Larry Wangh's lab are using innovative polymerase chain reaction (PCR) techniques and pattern recognition methods to identify which DNA sequences are present in diseases such as tuberculosis and cancer.
In this unit we will study vector and matrix representations of data. We have been doing this all along, but in this lab we will look specifically at how we can take advantage of this structure to examine our data. By paying attention to the structure of our data and arranging it in certain ways, we will be able to apply some tools from linear algebra to apply transformations such as scalings, rotations, reflections, and other interesting transforms. Further, these tools will allow us to analyze very high-dimensional data sets (in Lab 5.2) to pull out the most informative dimensions.
Vector and matrix representations
Let's think a bit more actively about data points as vectors or matrix values. Let's consider the following set of ordered pairs (you can cut and paste):
Here, each row is a different ordered pair with 2 dimensions. We can plot these points:
We could also represent the same data in column form, using 1 data point per column, using the transpose operator (a single apostrophe '):
We can also re-arrange dimensions when the dimension of the matrix is larger than 2 using the permute function, which is the N-dimensional generalization of the transpose operation:
Performing common operations across a dimension
Q1: How does mean(v,2) differ in dimension from mean(v,1)? How does mean(vt,1) differ in arrangement from mean(v,2)?
Linear transformations and Affine transformations
When we have our data in the form of a vector or a matrix, it is very easy to perform linear transformations or affine transformations on the data. A linear transformation (also known as a linear projection or a linear mapping) is a change-of-coordinates with a linear function. Suppose we have a set of ordered pairs (xi, yi). We can perform a linear transformation by using the equations:
xi* = a*xi + b*yi
yi* = c*xi + d*yi
The act of performing the transform is usually noted mathematically as T(x,y).
A linear transformation has to obey the following rules:
T(x1+x2,y1+y2) = T(x1,y1) + T(x2,y2) (additivity)
T(a*x1, a*y1) = a*T(x1,y1) (homogeneity)
An affine transformation is related to a linear transformation except that translations are allowed before or after the linear transformation:
xi* = a*xi + b*yi + e
yi* = c*xi + d*yi + f
When we write our data points in the form of column vectors, so that each column corresponds to a new data point (as the variable vt above), then we can express linear transformations very easily in the form of matrix multiplication:
A = [ a b ; c d]
then by matrix multiplication
A * [ x1 ; y1] = [ a*xi + b*yi; c*xi + d*yi]
Here we'll examine several common and useful affine and linear transformations:
The first common operation we often want to perform is to translate (or shift) all of our points by a particular value. If we are lucky enough that we want to shift all dimensions in the same way, then we can simply write:
(Leave this figure open.)
But often we are not so lucky. We might want to shift each dimension in a different way. In this case, we have to build our own matrix to subtract using the repmat command that we first saw in Lab 2.3.
Suppose we want to shift the points by 5 in x, and -5 in y. Then we can create a matrix that is the same size as vt:
(Again, leave the figure open.)
One common translation we want to make with experimental data is to remove the mean. The usefulness of this will be apparent very soon, when we look at scaling and rotation:
How the heck does one remember which is rows and columns?
Okay, you ask, I see that repmat can be used to create a matrix that is the same size as my data matrix, but I know I'm never going to remember which way the rows and columns go. For example, I'm sure I'll accidentally type
which will make my matrix be the wrong thing (try it).
How do I remember?
The answer is I don't. Instead, every time I use repmat (and just now as I was preparing this exercise), I write a little code on the command line to check which way it goes. If you get fast at writing a little code to check which way it goes, you'll be accurate and you don't have to remember.
Here's what I did for this exercise.
I knew my shift should be a single column, because the data in vt is organized in columns. So my column shifts are
Now how do I make sure it has 1 row and N columns, where N is the number of columns of vt? I checked which way the size function went first:
Then I said "Oh, that's right, size(vt,1) is the number of rows, and size(vt,2) is the number of columns. I never remember which way it goes but I'll remember for the next 5 minutes."
Then I wrote a little code both ways, and looked at the answer:
And that's how I figure out which is rows and which is columns. Maybe some people out there have it memorized, I certainly don't.
The same applies for mean, sum, etc ("do I want mean(v) or mean(v,2)?")
Scaling is a very simple linear transformation we often want to apply to the data.
This scales all of the points. We can also scale all the points but remove the mean first, so that we scale with respect to the center of the points:
This again scales all of the points.
Sometimes we might want to scale the dimensions differently. For example, we might want to scale x by 5 but only scale y by 2. To do this, we can multiply by the matrix [5 0; 0 2]. Think about the matrix multiplication; maybe write it out on a sheet of paper.
Rotation by an angle is also a linear transformation. The rotation matrix in 2-D for an angle theta in radians is as follows:
[cos(theta) -sin(theta) ; sin(theta) cos(theta) ]
For this transformation, I have written a function (because I can never remember where the negative sign goes and which ones are cosines and which are signs) (save in a folder in tools called math, be sure to add math to your path):
The rotation is around the origin, so there is a big difference whether or not you rotate the points in place, or first translate the points to be centered on the origin:
We can create a fun animation to rotate this around the clock.
The last major geometrical linear transformation is reflection about an angle theta:
The matrix for a 2-dimensional reflection about theta is:
[cos(2* theta) sin(2* theta) ; sin(2* theta) -cos(2* theta) ];
Once again, I've written a function for this transformation (save in the folder in tools called math):
We can explore this transformation in the same way we did for rotation:
And again we can create a fun animation to reflect this around the clock.
Compositions of transformations
A beautiful thing about linear transformations is that they can be composed super easily: it's just matrix multiplication. Here's an example of a scaling, rotation, and reflection in 1 line:
Now that we are armed with these linear and affine transformations (which have many many applications) we will be equipped for dimensionality reduction in Lab 5.2.
Let's now turn to looking at higher dimensional spaces (that is, when the dimension is greater than 2).
Plotting in higher dimensional spaces
It is quite common in science to generate data that has 3 or dimensions, but making sense out of such data and showing the results to others presents special challenges.
Continuously sampled data data (such as time series); surfaces
Consider the zebra finch song that we analyzed in Lab 3.2 (attached below). The spectrogram of the zebra finch song is essentially a 3-d piece of information; at each point in time (1), we have a power/intensity value (2) at each frequency (3).
(Leave this figure open.) The spectrogram method demonstrates a powerful way to display 3d information by essentially reducing it down to 2 dimensions in the form of an image (using color to code for power). One could look at a spectrogram in a paper and interpret it easily.
The spectrogram actually function plots a surface, which is a 3-dimensional structure. To illustrate how to do this in Matlab, let's provide output arguments to the spectrogram function, so we can get the data and plot it ourselves using the same functions that spectrogram calls.
The Matlab function that plots surfaces is called surf (see help surf).
Q2: Look at the 3D view. You can use the "Rotate 3-D" tool (the box with the circular arrow around it). Which view do you prefer, the 2-D view or the 3-D view? If you were going to put an image in a paper, which do you think would be easier for readers to digest?
It is possible to adjust the angle of view on the command line using the view command (see help view):
There are a lot of options in Matlab for viewing data in 3-D. A 3-d surface plot can be combined with a contour plot:
or to use another bit of example data (derived from the peaks function in Matlab, which is an example surface for demoing the suface commands):
High-dimensional data points
Finally, often we have individual data points with many parameters. As an example, there is a famous set of data of 3 different Iris species (Iris setosa, Iris versicolor, and Iris virginica) that was collected by Edgar Anderson and made famous by RA Fisher, who used the data to demonstrate linear discriminate analysis. (Check the link to see pictures of the 3 flower species.) Anderson measured 4 parameters of several flowers of each species: sepal length, sepal width, petal length, and petal width. Therefore, each flower is a 4-dimensional data point.
How should we plot this data?
We could use the built-in function plot3d (see help plot3d) and examine 3-dimensions at a time.
Q3: How do these 3 species differ? (You might need to use the Rotate 3-D tool to appreciate the different parameters.)
But, you complain, this is lame, we are leaving out one of the dimensions of the data. How can we represent all data points at once?
We can do this with a multidimensional scatter plot of the data. We first encountered scatter plots of 2-dimensions in Lab 2.1. Recall we can plot
To perform an N dimensional scatterplot in a Matlab figure, I use a routine written by a former colleague Maneesh Sahani called scatterplot.m. It relies on another function called assign.m. Both are attached at the end of this lab. Place scatterplot.m in the plotting folder inside your tools folder.
Take your mouse and move the legend out of the way so all the plots can be full size.
The scatterplot plots every dimension against every other dimension. The upper left graph has dimension 2 (sepal width, on the y axis) plotted against dimension 1 (sepal length, on the x axis). If you look down the left hand column of graphs, you see that the dimension 1 (sepal length) is consistent across the row, allowing you to see how the other dimensions (2, 3, and 4) co-vary with dimension 1. In the second column, dimension 2 is on the x axis, and dimensions 3 and 4 are on the y axis. By comparing the points across the graph rows and columns, you can get a sense of how the data looks in multi-dimensional space.
Q4: Describe how iris setosa differs from iris versicolor and iris virginica in all 4 dimensions. How do iris versicolor and iris virginica differ (in all 4 dimensions)?
Matlab functions and operators