In this unit we will explore methods for making sense out of data sets of high dimensions. We'll first start with a technique called cluster analysis and move on to trying to visualize the most meaningful information in high-dimensional data sets. In this lab, we will also look at a real research analysis problem that is in progress. The solution isn't perfectly polished (it's still evolving!) but I hope it will be educational nonetheless.
## Cluster analysis of unclassified samples: K-meansIn Lab 5.1 we looked at the famous iris data set of Anderson. Let's imagine that we had collected this data without any knowledge of the flower species, and we wanted to see if the data "naturally" clusters into some groups.
First, let's load and replot the data just like last time using the scatterplot.m file; leave this figure open off to the side.
iris_setosa = load('iris_setosa.txt','-ascii')
iris_versicolor = load('iris_versicolor.txt','-ascii')
iris_virginica = load('iris_virginica.txt','-ascii')
irisdata = [iris_setosa;iris_versicolor;iris_virginica];
figure(100);
% label groups, just like for ANOVA
G = [repmat(1,size(iris_setosa,1),1);
repmat(2,size(iris_versicolor,1),1);
repmat(3,size(iris_virginica,1),1);];
scatterplot(irisdata,'class',G,'markersize',8);
legend('Setosa','Versacolor','Virginica','Location','NorthOutside')
And once again move the legend out of the way and move the figure off to the side.
We can also plot the data without using the knowledge of the species identity. Let's do this now; again, keep the window open when you are finished creating it:
figure(101);
scatterplot(irisdata,'markersize',8);
Q1: Examine the data in Figure 101. Do you see any inherent clusters? Is there one "cloud" of data points that is clearly different from the rest?There are algorithms that will try to derive the best cluster representation given several data points. One of the most classic is K-means clustering, sometimes called Lloyd's Algorithm. In K-means clustering, one uses the following algorithm:
- Pick K points randomly from the data to be the "centroids" of the cluster
- Next, for each point in the data set, find the points that are closest to each centroid.
- Now, redefine each centroid to be equal to the mean of the points that were closest to it.
- Repeat steps 2 and 3 as many times as desired.
In K-means clustering, one has to provide the number of clusters ahead of time, but the algorithm is reasonably good at identifying clusters that are clearly different from one another. Let's try it out:
K = 2;
[idx,c] = kmeans(irisdata,K);
figure(102);
clf; % clear the figure;
scatterplot(irisdata,'class',idx,'markersize',4,'marker','o');
scatterplot(c,'class',[1:size(c,1)]','marker','o','markersize',8,'shrink',0);
In this graph, the small open circles correspond to individual datapoints and the large open circles correspond to the cluster centers.
Q2: How does the clustering algorithm do with K=2? K=3? K=4? Did any of your clustering runs exactly match the "real" division of the data into species (which we happen to have for this data set, but in general one doesn't have)?Here is a little function that demonstrates the progress of the algorithm (put in your stats folder in tools):
Now let's try it out. (It's easier to see what's going on with K=3 or K=4.)
kmeans_over_time(irisdata,4);
This is all the treatment we'll give to this "unknown category" problem. There are several common clustering algorithms that one can employ; see doc clusterdata for a list of several options and some demos. It is useful in many respects, including for identifying neurons on the end of an electrode, gene microarray interpretation, etc.
## Another example high-dimensional problem: identifying strains of tuberculosis[this section embargoed pending publication of a paper, we hope to have it free on the web soon] ## Finding the most informative dimensions in a data set: principle component analysisLet's play a little game to illustrate how we can pay attention to the most informative dimensions. Suppose we had a data set like the following (first download the rot3d.m function attached): g = rot3d(30*pi/180,1)*rot3d(45*pi/180,2)*... [10 0 0 ; 0 4 0 ; 0 0 0.3] * randn(200,3)'; figure; plot3(g(1,:),g(2,:),g(3,:),'bo') view(48,38); Q4: Using the Rotate 3D tool in the figure tool bar, rotate the view; is there a few where the data looks really "thin", meaning there isn't much that much variation in that dimension? In 3D, does the data look like a "cloud", with volume in all dimensions, or a "disk", lacking volume in one particular dimension?As you probably saw in Q4, although the data is 3-dimensional, most of the variation in the data is present in only 2 dimensions. It would be great to find a linear transformation that could transform the data so that the dimensions that were most informative (that is, the dimensions that varied the most) could be represented first. That way, we could focus our attention on these interesting dimensions and, if necessary, leave out the least informative dimensions. ## Principle component analysisThere is an algorithm for determining such a linear transformation called principle component analysis. To perform principle component analysis, we compute the covariance matrix (which we first saw in Lab 2.1). This makes sense; the only data that are needed to determine a projection onto the most informative dimensions are the variances of each dimensions and the covariance of each dimension with the other dimensions. ## Computing the principle components (PC)V = cov(g') % calculate covariance matrix % I transposed the data in g so each row would have 1 data point % the answer V should have dimension 3x3 [comps,factor] = pcacov(V) The matrix comps is a linear transformation that we can use to project our data onto the most informative dimensions. We'll look at how to do this in just a minute. But, at this point, we can look at the amount of the total variance in the data is explained by each dimension of the data, if we were to project it onto the new space. Let's look at the variance explained by each dimension. The variable factor describes how much variance each component explains; the total variance of the data set is given by the sum of the variances of all components, so we can calculate the percentage of the total variance that each component "explains": totalvariance = sum(factor) varexplained = 100 * factor ./ sum(totalvariance) figure; subplot(2,1,1); bar(varexplained); ylabel('Percentage variance explained'); xlabel('Dimension number'); box off; axis([0 3.5 0 100]); subplot(2,1,2); bar(cumsum(varexplained)); ylabel('Percentage total variance explained using components 1-N'); xlabel('Dimension number'); box off; axis([0 3.5 0 100]); Q5: How much of the total variance in the data is explained by the first component? The second? The third? The first 2 together?The fact that most of the variance is explained by just a couple components means we will learn a lot about the underlying structure of the data by projecting it onto fewer dimensions. ## Projecting the data onto the principle component spaceNow we can actually create a graphical projection of the data. To do this, we first have to subtract the mean of the data. Because PCA performs its analysis on the covariance matrix, the projection that is developed is with respect to the mean of all of the data. Then we can perform our linear transformation as normal: g_submean = g - repmat(mean(g,2),1,size(g,2)); g_proj = comps' * g_submean; % perform the projection figure; % plot only the first 2 dimensions plot(g_proj(1,:),g_proj(2,:),'bo'); xlabel('Principle component 1'); ylabel('Principle component 2'); Q6: Does the 2-dimensional principle component projection resemble the "best" view through the data in 3 dimensions (by "best", I mean the view where you could see the spread of the points the best)?The projection onto Principle components (PCs) is really the projection onto the directions of maximum variance in the space of the data. So it is the view where you can see the spread of the data the best.This principle of dimensionality reduction, of paying attention to the dimensions that matter the most and dropping dimensions that matter less, is the basis of "lossy" compression algorithms such as JPEG compression. You can read more about JPEG compression here. ## Projecting tuberculosis data onto PCs[embargoed right now pending publication] |

Labs >