Lab 5.2 High-dimensional space: clustering and dimensionality reduction

Introduction

In this unit we will explore methods for making sense out of data sets of high dimensions. We'll first start with a technique called cluster analysis and move on to trying to visualize the most meaningful information in high-dimensional data sets.

In this lab, we will also look at a real research analysis problem that is in progress. The solution isn't perfectly polished (it's still evolving!) but I hope it will be educational nonetheless.

Cluster analysis of unclassified samples: K-means

In Lab 5.1 we looked at the famous iris data set of Anderson. Let's imagine that we had collected this data without any knowledge of the flower species, and we wanted to see if the data "naturally" clusters into some groups.

First, let's load and replot the data just like last time using the scatterplot.m file; leave this figure open off to the side.

iris_setosa = load('iris_setosa.txt','-ascii')

iris_versicolor = load('iris_versicolor.txt','-ascii')

iris_virginica = load('iris_virginica.txt','-ascii')

irisdata = [iris_setosa;iris_versicolor;iris_virginica];

figure(100);

% label groups, just like for ANOVA

G = [repmat(1,size(iris_setosa,1),1);

repmat(2,size(iris_versicolor,1),1);

repmat(3,size(iris_virginica,1),1);];

dasw.plot.scatterplot(irisdata,'class',G,'markersize',8);

legend('Setosa','Versacolor','Virginica','Location','NorthOutside')

And once again move the legend out of the way and move the figure off to the side.

We can also plot the data without using the knowledge of the species identity. Let's do this now; again, keep the window open when you are finished creating it:

figure(101);

dasw.plot.scatterplot(irisdata,'markersize',8);

Q1: Examine the data in Figure 101. Do you see any inherent clusters? Is there one "cloud" of data points that is clearly different from the rest?

There are algorithms that will try to derive the best cluster representation given several data points. One of the most classic is K-means clustering, sometimes called Lloyd's Algorithm. In K-means clustering, one uses the following algorithm:

  1. Pick K points randomly from the data to be the "centroids" of the cluster

  2. Next, for each point in the data set, find the points that are closest to each centroid.

  3. Now, redefine each centroid to be equal to the mean of the points that were closest to it.

  4. Repeat steps 2 and 3 as many times as desired.

In K-means clustering, one has to provide the number of clusters ahead of time, but the algorithm is reasonably good at identifying clusters that are clearly different from one another. Let's try it out:

K = 2;

[idx,c] = kmeans(irisdata,K);

figure(102);

clf; % clear the figure;

dasw.plot.scatterplot(irisdata,'class',idx,'markersize',4,'marker','o');

dasw.plot.scatterplot(c,'class',[1:size(c,1)]','marker','o','markersize',8,'shrink',0);

In this graph, the small open circles correspond to individual datapoints and the large open circles correspond to the cluster centers.

Q2: How does the clustering algorithm do with K=2? K=3? K=4? Did any of your clustering runs exactly match the "real" division of the data into species (which we happen to have for this data set, but in general one doesn't have)?

Here is a little function that demonstrates the progress of the algorithm (put in your +dasw/+stats package):

kmeans_over_time.m

function [indx,c] = kmeans_over_time(data, K)

% KMEANS_OVER_TIME

%

% [IDX, C] = dasw.kmeans_over_time(DATA,K)

%

% Shows the progress of the K-means algorithm over each iteration

% for 15 iterations.

%

% DATA and K should be arguments that could be fed to KMEANS.

%

% See also: KMEANS

% pick 3 distinct points at random

r = randperm(size(data,1));

r = r(1:K);

f = figure;

for t=1:15,

[indx,c]=kmeans(data,K,'Start',data(r,:),'MaxIter',t);

figure(f);

clf;

dasw.plot.scatterplot(data,'class',indx,'markersize',4,'marker','o');

dasw.plot.scatterplot(c,'class',[1:size(c,1)]','marker','o','markersize',8,'shrink',0);

pause(1);

end;

Now let's try it out. (It's easier to see what's going on with K=3 or K=4.)

dasw.stats.kmeans_over_time(irisdata,4);

This is all the treatment we'll give to this "unknown category" problem. There are several common clustering algorithms that one can employ; see doc clusterdata for a list of several options and some demos. It is useful in many respects, including for identifying neurons on the end of an electrode, gene microarray interpretation, etc.

Another example high-dimensional problem: identifying strains of tuberculosis

Tuberculosis is an infectious disease caused by a subclass of bacteria called mycobacteria. While about 75% of cases are asymptomatic, about 10% of cases progress to active disease; about half of these active disease cases would cause death without antibiotics.

Unfortunately, many mutant mycobacterial strains of tuberculosis are resistant to particular antibiotics. So, when treating a patient, it is important to know which strain of mycobacteria has infected a patient. These efforts are complicated by the fact that most TB cases are found in developing countries, where access to gene sequencing techniques is uncommon.

Larry Wangh's lab at Brandeis has developed methods for distinguishing these species with very small samples of DNA. The idea is that a patient would cough to provide a sample of phlegm, and the sample would be placed into a small tube with some marker proteins and DNA primers. Next, PCR would be performed to amplify the bacterial DNA. At this point, there would be a bunch of bacterial DNA and some protein markers in the tube.

The protein markers are designed as an ON/OFF pair. Initially, at low temperatures, both markers wrap around on themselves in loops (see figure, panel A). 1 marker, the ON marker (orange in the figure), has an attached molecule that is capable of becoming fluorescent , but it is in close physical proximity to a "quenching" molecular that prohibits the fluorescence. The OFF marker has a quenching molecule as well.

The markers are designed so that they have nucleotides that are complimentary to the DNA of interest. If the temperature is increased (as in panel B), there will be a critical temperature at which it is more stable for the ON marker protein to unfold and attach to the DNA. The exact temperature depends on the quality of the match between the nucleotides on the DNA and the marker protein as well as the exact sizes of the marker proteins (which differ from application to application; the sizes in the figure were chosen arbitrarily). When the ON marker protein unfolds, the quenching molecule is moved away from the fluorescent molecule, which can now express its fluorescence and be observed by a detector.

As the temperature is increased further (panel C), it becomes chemically stable for the OFF marker to unfold and bind to a second sequence that is in close proximity to the sequence that is recognized by the ON marker. If this occurs, then once again the fluorescent molecule is quenched and the fluorescent signal is reduced.

A Brandeis undergraduate, Skye Fishbein, has worked with Wangh's lab to develop markers for distinguishing the most deadly strains mycobacterium that cause the disease tuberculosis: Mycobacterium tuberculosis, Mycobacterium bovis, Mycobacterium microti, and Mycobacterium africanum.

The resulting fluorescence as a function of temperature is called a "melt curve"; each mycobacterial species has a different melt curve signature. We'll load some functions from Skye's Matlab library (developed in collaboration with Steve) to plot these curves.

Download and install the package fishbein_library.zip. Make sure to add the functions to your path by running addpath(genpath(['fishbein_library'])) in the same directory as the library.

Note: This is an in-progress research library, so here is fair warning that 1) some things might not be fully functional or correct at this point, and 2) some of the functions might represent ideas that we've dropped and are no longer actively developing. Also, these functions are not organized in a Matlab package; they are just a collection of files.

Let's read in an example run where Skye looked at the melt curves of many bacteria species with her markers. In each tube, she put a little DNA from a different bacteria species (with some repetition), and the same sets of ON and OFF markers that she designed. Although only 1 set of markers is depicted in the figure above, Skye used several markers that are the same color for this assay.

standards = read_standard_data('datasets/SF110627.txt');

The structure returned by read_standard_data is a list of all the data in her run.

If you type

standards(1)

you'll access the first element, which happens to correspond to the tubes that contained M. africanum. This data type is a structure, just like the type of output returned by the Matlab function dir. The strain for each group is a string:

standards(1).strain

The fluorescent melt curve data from the tubes that had M. africanum is in standards(1).data. It is stored in a data type that we haven't seen yet, called a cell.


The cell data type

In Matlab, most variables are stored as matrix values. (In fact, in the old days in Matlab before the mid 1990's, before version 5, all variables were matrix values.) This is great because we can bring all sorts of math to bear on our data very quickly.

However, storing everything as a matrix also has drawbacks. What if you want to store lists of numbers, where the number of elements might be different?

For example, suppose we wanted to store time series measurements of blood glucose values for subjects for several hours after a meal (1 row for the time points, 1 row for the blood glucose values), but we couldn't be sure each subject would have the same number of measurements? We could use separate variables, like this

subject1 = [ 0 3 6 9 12 ; 180 100 80 70 60]

subject2 = [ 0 4 8 12; 180 95 82 75]

But if we had many many subjects, it would be inconvenient to write a program that looped over so many different variable names. Could we make this a matrix? Not really, try it:

subjects = [subject1 ; subject2]

What error do you get? The matrixes simply don't fit together because they differ in the number of elements.

Enter the cell data type. Cells are a little like matrixes except

  1. The elements are accessed by curly braces, like a{i} or a{i,j}

  2. Different elements within a cell don't have to be the same size or even be the same data types

  3. You can't do math directly on a cell; you have to pull out the data first, and then do math on the data.

Example:

subjects = {};

subjects{1} = subject1;

subjects{2} = subject2;

subjects{1}

subjects{2}

Another example:

mycell = {};

mycell{1} = 'This is a string';

mycell{2} = 12345;

mycell

Cells are a little like a spreadsheet; you can put anything you want in there, and pull it out later. They are really great for organizing information that has odd sizes or different types.

Here, we used cell arrays to store the data since, in principle, each example might have different temperature values or different numbers of points. She ran 3 separate tubes with M. africanum as you can see by looking at the size of standards(1).data:

standards(1).data

Let's look at one example melt curve:

standards(1).data{1}

We have written a function to plot melt curves. Let's take a look at the signatures of M. africanum (the 1st standard) and M. bovis (the 8th standard):

figure;

plot_standard(standards([1 8]),3)

title('Green is M. africanum, Red is M. bovis');

xlabel('Temperature');

ylabel('Fluorescence (a.u.)');

You might be able to appreciate that there is a gradual inherent increase in fluorescence with temperature, and along the way you can see several inflections in the curve. These inflections correspond to temperatures at which ON or OFF markers are binding to the DNA of interest.

Members of Larry's lab have discovered that these inflection points can be highlighted by examining the derivative of the data rather than the raw data itself; further, Skye and I discovered that the 2nd derivative identifies these inflection points rather consistently across runs. Before taking the 2nd derivative, we normalize by making the first point 0 and the last point 1:

normstandards1 = modify_standards(standards,'normalizeNdim_01');

normstandards2 = modify_standards(normstandards1,'normalize_by_lastvalue');

normstandards3 = modify_standards(normstandards2,'meltcurveNdim_2derivative');

figure;

plot_standard(normstandards3([1 8]),3)

title('Green is M. africanum, Red is M. bovis');

xlabel('Temperature');

ylabel('Fluorescence (a.u.)');

Q3: The thin lines in the plot indicate individual tubes, and the thick line indicates the mean of all tubes for that bacterial species. Do the individual runs look very similar to the mean, or is there a lot of variance? If you were looking by eye, could you distinguish M. bovis from M. africanum?

Now let's plot the whole data set:

figure(105);

plot_standard(normstandards3,3)

xlabel('Temperature');

ylabel('Fluorescence (a.u.)');

You can see there are a number of curves for a number of species. It's starting to get overwhelming. We would benefit from a way to examine the performance of the markers in a lower dimensional space. Can we do this, or do we need to keep looking at the full curves?

Finding the most informative dimensions in a data set: principle component analysis

Let's play a little game to illustrate how we can pay attention to the most informative dimensions. Suppose we had a data set like the following (first download the rot3d.m function attached and put it in +dasw/+math):

g = dasw.math.rot3d(30*pi/180,1)*dasw.math.rot3d(45*pi/180,2)*...

[10 0 0 ; 0 4 0 ; 0 0 0.3] * randn(200,3)';

figure;

plot3(g(1,:),g(2,:),g(3,:),'bo')

view(48,38);

Q4: Using the Rotate 3D tool in the figure tool bar, rotate the view; is there a few where the data looks really "thin", meaning there isn't much that much variation in that dimension? In 3D, does the data look like a "cloud", with volume in all dimensions, or a "disk", lacking volume in one particular dimension?

As you probably saw in Q4, although the data is 3-dimensional, most of the variation in the data is present in only 2 dimensions. It would be great to find a linear transformation that could transform the data so that the dimensions that were most informative (that is, the dimensions that varied the most) could be represented first. That way, we could focus our attention on these interesting dimensions and, if necessary, leave out the least informative dimensions.

Principle component analysis

There is an algorithm for determining such a linear transformation called principle component analysis. To perform principle component analysis, we compute the covariance matrix (which we first saw in Lab 2.1). This makes sense; the only data that are needed to determine a projection onto the most informative dimensions are the variances of each dimensions and the covariance of each dimension with the other dimensions.

Computing the principle components (PC)

V = cov(g') % calculate covariance matrix

% I transposed the data in g so each row would have 1 data point

% the answer V should have dimension 3x3

[comps,factor] = pcacov(V)

The matrix comps is a linear transformation that we can use to project our data onto the most informative dimensions. We'll look at how to do this in just a minute. But, at this point, we can look at the amount of the total variance in the data is explained by each dimension of the data, if we were to project it onto the new space. Let's look at the variance explained by each dimension.

The variable factor describes how much variance each component explains; the total variance of the data set is given by the sum of the variances of all components, so we can calculate the percentage of the total variance that each component "explains":

totalvariance = sum(factor)

varexplained = 100 * factor ./ sum(totalvariance)

figure;

subplot(2,1,1);

bar(varexplained);

ylabel('Percentage variance explained');

xlabel('Dimension number');

box off;

axis([0 3.5 0 100]);

subplot(2,1,2);

bar(cumsum(varexplained));

ylabel('Percentage total variance explained using components 1-N');

xlabel('Dimension number');

box off;

axis([0 3.5 0 100]);

Q5: How much of the total variance in the data is explained by the first component? The second? The third? The first 2 together?

The fact that most of the variance is explained by just a couple components means we will learn a lot about the underlying structure of the data by projecting it onto fewer dimensions.

Projecting the data onto the principle component space

Now we can actually create a graphical projection of the data. To do this, we first have to subtract the mean of the data. Because PCA performs its analysis on the covariance matrix, the projection that is developed is with respect to the mean of all of the data. Then we can perform our linear transformation as normal:

g_submean = g - repmat(mean(g,2),1,size(g,2));

g_proj = comps' * g_submean; % perform the projection

figure;

% plot only the first 2 dimensions

plot(g_proj(1,:),g_proj(2,:),'bo');

xlabel('Principle component 1');

ylabel('Principle component 2');

Q6: Does the 2-dimensional principle component projection resemble the "best" view through the data in 3 dimensions (by "best", I mean the view where you could see the spread of the points the best)?

The projection onto Principle components (PCs) is really the projection onto the directions of maximum variance in the space of the data. So it is the view where you can see the spread of the data the best.

This principle of dimensionality reduction, of paying attention to the dimensions that matter the most and dropping dimensions that matter less, is the basis of "lossy" compression algorithms such as JPEG compression. You can read more about JPEG compression here.

Projecting tuberculosis data onto PCs

Now we can write a function to perform the principle component projection on the melt curve data. We're primarily interested in understanding the structure underlying the differences among the means of the data points rather than the scatter of individual points, so we will first compute the mean melt curve shape for each bacterial species. Then, we can project our points onto the space. Skye and I have written a library function that does this, and we'll call it here:

figure(106);

[h,pc,v,themean] = plot_standard_pca(normstandards3);

Leave this figure open for later.

In this function, the v output argument returns the variance that is explained by each component. Let's look at v:

v,

Wait, you say, why are some of the entries complex? I see that the biggest entries are at the top of the list, which is what we expect (we expect the components that explain the most to be first), but how am I supposed to calculate the total variance given some of the variances are apparently complex and have the sqrt(-1) in them?

The short answer is that the dimensions of the data (64) is larger than the number of mean values being averaged (25), so the calculation of PCs is poorly determined. This situation can also arise if some of the dimensions are completely determined by other dimensions. These complex values are quite small, so we can safely ignore them by only considering only the "real" part of the data (using the function real).

var_explained = 100* real(v) / sum(real(v));

figure;

subplot(2,1,1);

bar(var_explained);

ylabel('Percentage variance explained');

xlabel('Dimension number');

box off;

axis([0 65 0 100]);

subplot(2,1,2);

bar(cumsum(var_explained));

ylabel('Percentage total variance explained using components 1-N');

xlabel('Dimension number');

box off;

axis([0 65 0 100]);

Q7: How much of the variance does the first 1-5 principle components explain? Is it easier to see the differences among the samples in the projection onto the first 2 principle components, or is it easier to see these differences in the melt curve plots in figure 105 above?

Skye has also written a version that will project the data (without labels) onto the first N principle components:

figure;

plot_standard_pcaN(normstandards3,4);

Our mission: identify the mycobacterium infecting our patient, and estimating the likelihood of correctly identifying the strain

One of the missions of this project is to take a new, unknown sample from the field, and say whether it is or is not M. tuberculosis. To simulate how well we can do this, let's calculate the mean and covariance of the M. tuberculosis bacteria tubes:

TB_tubes = [23 24];

TB_data = [];

for i=1:length(TB_tubes),

for j=1:size(normstandards3(TB_tubes(i)).data,2),

TB_data = [TB_data; normstandards3(TB_tubes(i)).data{j}(2,:)];

end;

end;

TB_data_mean = mean(TB_data);

TB_data_std = std(TB_data);

Now let's generate 100 randomized samples of M. tuberculosis; we'll add random noise to the mean M. tuberculosis tube with a standard deviation equal to the covariance we measured:

TB_simulated = [];

for k=1:100,

TB_simulated = [TB_simulated; TB_data_mean + ...

randn(1,64) .* TB_data_std ];

end;

Skye has written a function that takes each sample and computes the squared error between that sample and the mean found for every bacteria in her database. The program then classifies the strain of the bacterium using the standards. We can then report the fraction of the samples that were properly identified:

mycomparisonstandard = normstandards3(23);

for i=1:100,

mycomparisonstandard.data{i} = [normstandards3(23).data{1}(1,:) ; ...

TB_simulated(i,:)];

end;

actvsclass=check_classifier_name(normstandards3,...

mycomparisonstandard,'classify_melt_curve');

actvsclass,

Q8: Look at the 2 columns. The left column shows the species that is to be identified, and the right column shows the best match. Do they always agree?

We really need to examine 3 things: what's the likelihood we have a M. tuberculosis sample and report that it is M. tuberculosis, what's the likelihood we have a M. tuberculosis sample and we report that it is NOT M. tuberculosis (a false negative), and what is the likelihood that we have a different sample and we report it wrongly as M. tuberculosis (a false positive).

The following function generates simulated samples for each strain, and performs smallest squared error classification on all of these samples:

[actvsclass,errormatrix,noisystandards]=noise_analysis3(normstandards3,50, 0);

We can project these stimulated data points onto the principle component space. Here, we'll just plot some of the simulated M. tuberculosis points onto the PC space:

figure(106);

plot_standard_projection(noisystandards(23),pc,themean);

Q9: What is the likelihood that the sample is M. tuberculosis and we report it properly as M. tuberculosis? What is the likelihood that the sample is M. tuberculosis and we mis-report it? What is the likelihood that we have something else and we report M. tuberculosis? Does the cloud of simulated TB points in the principle appear to be well separated from the other clusters?

Skye has a new set of markers that further separate more of the bacterial species; work continues to make the system more robust to noise and other types of experimental errors we haven't dealt with here.

Function reference

Matlab functions and operators

User functions

  • dasw.plot.kmeans_over_time.m

  • [Skye's library]