Lab 5.3 High-dimensional space: interpolation; how PCA works
In the last lab (5.2), we used principle component analysis to visualize a high-dimensional space. In this case, that high-dimensional space was defined by melt curves that indicated the fluorescent activity of probes binding to sequences of single-stranded DNA. Each melt curve profile identifies a particular sequence of DNA, and can be used to detect, among other things, which strain of tuberculosis has infected a patient, or which sequence of a gene is found in a particular cancer.
In this lab, we tackle a real-world problem in high-dimensional data analysis, which is when data are collected slightly differently and need to be aligned by interpolation.
In the second part of the lab, we will examine principle component analysis in more detail in order to reveal the mechanism by which it finds the direction of maximum variance in a sample.
Aligning data that are sampled at different intervals
Suppose you were working in the Wangh lab, and measured the melt curve in the file mycurve.txt (after normalizing and taking the 2nd derivative):
Now the new postdoc in the lab measures the same curve and says he would like to compare the difference between his melt curve and the melt curve that you measured. The new postdoc's curve is in othercurve.txt. First, let's plot the melt curve that the postdoc recorded:
Do the plots look similar? They do. So far, so good. The new postdoc seems to already know what he is doing in the lab. Now let's calculate the difference between the curves. Does this work?
Why doesn't it work? Let's look at the number of data points:
What?? The postdoc's curve has several more data points. Let's look in more detail. The temperature points that you both ran are the following:
Q1: Are these temperature points the same? Is the same basic range covered?
Rats!! The postdoc collected perfectly reasonable data but the temperature points don't line up. Either you all didn't agree on the data points that should have been measured before hand, or, perhaps, the machine doesn't allow you to specify the actual temperature points that are recorded. Either way, it's out of your control now. And, why would you want to limit yourself to particular temperature values? It seems a little lame.
So how can we compare these curves?
Fortunately, there is a family of methods that will allow us to make a reasonable "resampling" of the data called interpolation. It allows us to interpolate the values of our data between the actual points that were measured. As long as we have sampled our data reasonably densely, and we assume that the data does not change too much between adjacent points, we can do this without impacting the quality of our data analysis.
You may be unaware but you are already familiar with a form of interpolation. When Matlab plots lines between data points, it is doing a form of interpolation. Let's re-plot mycurve:
Examine the plot closely. The actual data points are plotted with the plots, but Matlab also shows the points connected by lines. That is, it is displaying the values between the points as varying linearly between the previous data point and the next data point. There is no data in your experiment that suggests this is a proper and good thing to do, the lines are fiction, totally made up. Sometimes they are helpful, as they help us to visualize a set of discrete points as a curve, but other times they hurt, as they encourage us to imagine what might be going on between the data points when we might not have the proper information.
Q2: Can you see the distinction between the data points and the lines that connect them in a standard Matlab line plot?
Okay, let's make our own interpolation of our data. Let's increase the density of our temperature axis points by creating a data point every for every 0.5 degrees C instead of roughly every degree. To do this, we will use the Matlab function interp1 (see help interp1).
Q3: Do you see the extra data points in the new graph?
There are multiple algoritms for interpolation (see help interp1). You can use "nearest neighbor", which performs no interpolation but selects the value of the closest data point. This method has the advantage that it makes no assumptions about the data between the raw data points. Another common algorithm is the cubic spline, which are piecewise polynomial fits that are relatively smooth near the actual data points (see Wikipedia article).
Let's try cubic splines:
Cubic splines add a bit of curvature between points and for some applications they make sense. In some research settings I have seen splines add very odd twists, so I am usually scared to use anything except linear interpolation.
So how can we get our temperature data on a common scale to solve our problem?
Let's pick a common temperature range.
How principle component analysis works
In the previous lab (5.2) we saw that one could use principle component analysis to identify the directions of maximum variation in a data set. In this way, we could reduce the 60-dimensional space of the melt curves as a function of temperature to 3-4 dimensions, and we could visualize the similarities and differences among samples from like and unlike strains of tuberculosis.
We presented the method by calling the Matlab function pcacov, and I told you that the purpose of this function is to find the directions of maximum variation. in a sort of magical fashion. But how does it work exactly? Principle component analysis is a beautiful and relatively straightforward application of linear algebra, and we will go through it here.
We will use a very simple example to explain how principle component analysis works. We will restrict ourselves to working with 2-dimensional data. Let's artificially generate 3 clusters of data:
The data is divided into 3 clusters, and stretches out along an oblique line that is the direction of maximum variation (artificially made to be about 30 degrees off of the x-axis here). Suppose we didn't know how the data were constructed, but we wanted to find this direction of maximum variation. What could we do?
One useful quantity to examine is the covariance of the data:
Recall from Lab 2.1 that the covariance calculates how much knowing the value of 1 variable tells you about the other variable. The diagonals of the covariance matrix C are just the variances of the first and second dimension, respectively, while the off-diagonals are the covariance of dimension 1 and dimension 2.
The covariance matrix C can also be viewed as a linear transformation, simply because any matrix can be viewed as a linear transformation (as we learned in Lab 5.1). Recall that a linear transformation A can be used to scale/stretch, rotate, or reflect vectors around the origin.
Linear transformations in action
I have written a function that allows one to visualize how linear transformation operates on vectors. Download the function linear_transform_explorer.m below.
Let's see how the scaling transformation [2 0; 0 1], which scales the first dimension by a factor of 2 and does not scale the second dimension, transforms the vectors around the unit circle. For now, ignore the last 2 plots that are called the eigenvectors.
You can see that this linear transform stretches the unit circle in the first dimension.
We can look at another transformation, a reflection:
You can see how the transformation reflects the vectors across the 30 degree line.
These transformations both have "special vectors" that, instead of being rotated or reflected by the transformation, are merely scaled. These vectors are called eigenvectors (German: same vectors). They occur at these "special vectors" in the transformation.
For example, in the reflection of 30 degrees, the vector [-0.866 -0.5] happens to be on the line of reflection, and is merely scaled (but not reflected) by the transform.
These eigenvectors represent key features of the transformation being applied. There can be 2 eigenvectors in a 2-dimensional transform. The other eigenvector for this rotation is the vector that is perpendicular to the line of reflection:
We can project the data onto the eigenvectors to "simplify" the linear transformation:
The resulting matrix still contains a reflection (a fundamental operation) but it is now a simpler reflection, along the X axis.
For a given matrix, how do you find eigenvectors?
This is something that computers do for high-dimensional spaces, but is easy to do by hand for small matrixes. The trick is to find vectors such that the transform only scales them. That is, you solve for:
A * x = lambda * x
where A is a matrix, x is a vector, and lambda is a scalar (just a number).
Equivalently, you solve for
(A-lambda) * x = 0
If you have a small matrix A = [ a b ; c d ], then this is just solving the following equation for values of x1 and x2 (and the corresponding values of lambda):
(a-lambda)*x1 + (b-lambda)*x2 = 0
(c-lambda)x1 + (d-lambda)*x2 = 0
subject to the constraint that the eigenvector has length 1 (that is, x1^2 + x2^2 == 1).
Matlab does this very well for arbitrary dimensions with the function eig:
The eigenvectors of the covariance matrix identify the fundamental components of the variation
We can view the covariance matrix as a linear transformation. It is a symmetric matrix so it is also guaranteed to have real eigenvalues (due to a theorem in math).
Let's take a look at our covariance matrix C as a linear transformation:
Notice that it is a stretching operation, and notice that there is a line that is merely scaled instead of rotated or reflected. This line is exactly the line of maximum variation of the data.
We can use these eigenvectors to project the data onto a simpler space:
I'm also going to re-arrange the eigenvectors so that the vector with the largest eigenvalue (diagonal values of D) is first, so that we get the biggest fundamental component first. The function diag returns the diagonal elements of D:
Now the direction of maximum variation is nicely along the X axis, and the direction with the 2nd most variation is nicely along the Y axis. If we had to drop a dimension, we could choose the Y axis knowing that most of the variation in the data is along the X axis. The corresponding eigenvalues of the covariance matrix (in D) show how much relative variation is accounted for by each component:
Q4: How much of the total variation is in the first principle component? The second?
Matlab functions and operators
Copyright 2012 Stephen D. Van Hooser, all rights reserved