Lab 5.1 High-dimensional space: data points as vectors

Introduction

Over the past 2 decades, new biological tools and additional computing power have allowed scientist to collect and examine more data than ever before. A scientist in Eve Marder's lab here at Brandeis created 20 million different models of a 3-cell pacemaker circuit and discovered that a variety of combinations of cell intrinsic properties and synaptic properties can give rise to circuits that perform the same action (Prinz and Marder, 2004). Scientists in Sacha Nelson's lab used microarrays to examine the RNA expression patterns of 12,556 genes in 12 different cell types in the brain in order to understand which patterns of genes are expressed in different cell types (Sugino et al. 2006). And scientists in Larry Wangh's lab are using innovative polymerase chain reaction (PCR) techniques and pattern recognition methods to identify which DNA sequences are present in diseases such as tuberculosis and cancer.

In this unit we will study vector and matrix representations of data. We have been doing this all along, but in this lab we will look specifically at how we can take advantage of this structure to examine our data. By paying attention to the structure of our data and arranging it in certain ways, we will be able to apply some tools from linear algebra to apply transformations such as scalings, rotations, reflections, and other interesting transforms. Further, these tools will allow us to analyze very high-dimensional data sets (in Lab 5.2) to pull out the most informative dimensions.

Vector and matrix representations

Let's think a bit more actively about data points as vectors or matrix values. Let's consider the following set of ordered pairs (you can cut and paste):

v=[ 0.45 1.0;

0.65 1.0;

0.65 0.8;

0.6 0.8;

0.6 0.75;

0.85 0.75;

1.05 1;

1.10 0.95;

0.9 0.65;

0.75 0.65;

0.75 0.1;

0.35 0.1;

0.35 0.65;

0.25 0.3;

0.25 0.1;

0.2 0.1;

0.2 0.35;

0.3 0.75;

0.5 0.75;

0.5 0.8;

0.45 0.8;

0.45 1.0 ];

Here, each row is a different ordered pair with 2 dimensions. We can plot these points:

figure;

plot(v(:,1),v(:,2),'b-o');

Rearranging dimensions

We could also represent the same data in column form, using 1 data point per column, using the transpose operator (a single apostrophe '):

vt = v'

We can also re-arrange dimensions when the dimension of the matrix is larger than 2 using the permute function, which is the N-dimensional generalization of the transpose operation:

v = permute(v,[1 2]) % does nothing

vt = permute(v,[2 1]) % does transpose

a = rand(1,2,3) % makes a 3 dimensional matrix, 1x2x3 with random elements

at = permute(a,[3 2 1]) % re-arranges order to be 3 x 2 x 1

Performing common operations across a dimension

Most of the common Matlab routines, such as mean, median, or sum, allow you to specify the dimension along which the computation is to be performed. For example:

mean(v) % computes across the 1st dimension; that is, across columns

mean(v,1) % tells mean to use 1st dimension, same as mean(v)

mean(v,2) % tells mean to use 2nd dimension

mean(vt,1)

mean(vt,2)

sum(v)

sum(v,1)

sum(v,2)

Q1: How does mean(v,2) differ in dimension from mean(v,1)? How does mean(vt,1) differ in arrangement from mean(v,2)?

Linear transformations and Affine transformations

When we have our data in the form of a vector or a matrix, it is very easy to perform linear transformations or affine transformations on the data. A linear transformation (also known as a linear projection or a linear mapping) is a change-of-coordinates with a linear function. Suppose we have a set of ordered pairs (xi, yi). We can perform a linear transformation by using the equations:

xi* = a*xi + b*yi

yi* = c*xi + d*yi

The act of performing the transform is usually noted mathematically as T(x,y).

A linear transformation has to obey the following rules:

T(x1+x2,y1+y2) = T(x1,y1) + T(x2,y2) (additivity)

T(a*x1, a*y1) = a*T(x1,y1) (homogeneity)

An affine transformation is related to a linear transformation except that translations are allowed before or after the linear transformation:

xi* = a*xi + b*yi + e

yi* = c*xi + d*yi + f

When we write our data points in the form of column vectors, so that each column corresponds to a new data point (as the variable vt above), then we can express linear transformations very easily in the form of matrix multiplication:

A = [ a b ; c d]

then by matrix multiplication

A * [ x1 ; y1] = [ a*xi + b*yi; c*xi + d*yi]

Here we'll examine several common and useful affine and linear transformations:

Translation

The first common operation we often want to perform is to translate (or shift) all of our points by a particular value. If we are lucky enough that we want to shift all dimensions in the same way, then we can simply write:

vt_shift = vt + 5;

figure(100);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

plot(vt_shift(1,:),vt_shift(2,:),'r-x');

title('Original in blue, change in red');

(Leave this figure open.)

But often we are not so lucky. We might want to shift each dimension in a different way. In this case, we have to build our own matrix to subtract using the repmat command that we first saw in Lab 2.3.

Suppose we want to shift the points by 5 in x, and -5 in y. Then we can create a matrix that is the same size as vt:

shift = repmat([5; -5], 1, size(vt,2)) % 1 row, same number of columns as vt

vt_shift2 = vt + shift;

figure(100);

hold on;

plot(vt_shift2(1,:),vt_shift2(2,:),'k-^');

title('Original in blue, change1 in red, change2 in black');

(Again, leave the figure open.)

One common translation we want to make with experimental data is to remove the mean. The usefulness of this will be apparent very soon, when we look at scaling and rotation:

vt_submean = vt - repmat(mean(vt,2),1,size(vt,2));

figure(100);

hold on;

plot(vt_submean(1,:),vt_submean(2,:),'m-d');

How the heck does one remember which is rows and columns?

Okay, you ask, I see that repmat can be used to create a matrix that is the same size as my data matrix, but I know I'm never going to remember which way the rows and columns go. For example, I'm sure I'll accidentally type

shift = repmat([5; -5], size(vt,1), 1)

which will make my matrix be the wrong thing (try it).

How do I remember?

The answer is I don't. Instead, every time I use repmat (and just now as I was preparing this exercise), I write a little code on the command line to check which way it goes. If you get fast at writing a little code to check which way it goes, you'll be accurate and you don't have to remember.

Here's what I did for this exercise.

I knew my shift should be a single column, because the data in vt is organized in columns. So my column shifts are

[5; -5]

Now how do I make sure it has 1 row and N columns, where N is the number of columns of vt? I checked which way the size function went first:

size(vt)

Then I said "Oh, that's right, size(vt,1) is the number of rows, and size(vt,2) is the number of columns. I never remember which way it goes but I'll remember for the next 5 minutes."

Then I wrote a little code both ways, and looked at the answer:

repmat([5;-5],size(vt,1),size(vt,2)) % nope, duh clearly wrong

repmat([5;-5],1,size(vt,2)) %yes, this is the same size as vt

And that's how I figure out which is rows and which is columns. Maybe some people out there have it memorized, I certainly don't.

The same applies for mean, sum, etc ("do I want mean(v) or mean(v,2)?")

Scaling

Scaling is a very simple linear transformation we often want to apply to the data.

vt_scale = 5*vt;

figure(101);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

plot(vt_scale(1,:),vt_scale(2,:),'r-x');

title('Original in blue, scale1 in red');

This scales all of the points. We can also scale all the points but remove the mean first, so that we scale with respect to the center of the points:

vt_scale2 = 5*vt_submean;

figure(101);

hold on;

plot(vt_scale2(1,:),vt_scale2(2,:),'k-^');

title('Original blue, scale1 red, scale2 black');

This again scales all of the points.

Sometimes we might want to scale the dimensions differently. For example, we might want to scale x by 5 but only scale y by 2. To do this, we can multiply by the matrix [5 0; 0 2]. Think about the matrix multiplication; maybe write it out on a sheet of paper.

vt_scale3 = [5 0; 0 2]*vt_submean;

figure(101);

hold on;

plot(vt_scale3(1,:),vt_scale3(2,:),'m-d');

title('Original blue, scale1 red, scale2 black, scale3 magenta');

Rotation

Rotation by an angle is also a linear transformation. The rotation matrix in 2-D for an angle theta in radians is as follows:

[cos(theta) -sin(theta) ; sin(theta) cos(theta) ]

For this transformation, I have written a function (because I can never remember where the negative sign goes and which ones are cosines and which are signs) (save in a folder in tools called math, be sure to add math to your path):

rot2d.m

function r = rot2d(theta)

% ROT2D(THETA) 2D rotation matrix

%

% R = dasw.math.rot2d(theta)

%

% Returns the 2D rotation matrix:

%

% R = [cos(theta) -sin(theta) ; sin(theta) cos(theta) ]

%

r = [cos(theta) -sin(theta) ; sin(theta) cos(theta) ];

The rotation is around the origin, so there is a big difference whether or not you rotate the points in place, or first translate the points to be centered on the origin:

vt_rot1 = dasw.math.rot2d(45*pi/180)*vt; % 45 degrees, converted to radians

vt_rot2 = dasw.math.rot2d(45*pi/180)*vt_submean;

figure(102);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

plot(vt_rot1(1,:),vt_rot1(2,:),'r-x');

plot(vt_rot2(1,:),vt_rot2(2,:),'k-^');

title('Original blue, rot1 red, rot2 black');

We can create a fun animation to rotate this around the clock.

angles = 0:5:360;

h1 = [];

h2 = [];

for i=1:length(angles),

if ~isempty(h1), delete(h1); end;

if ~isempty(h2), delete(h2); end;

vt_rot1 = dasw.math.rot2d(angles(i)*pi/180)*vt; % degrees, converted to radians

vt_rot2 = dasw.math.rot2d(angles(i)*pi/180)*vt_submean;

figure(103);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

h1 = plot(vt_rot1(1,:),vt_rot1(2,:),'r-x');

h2 = plot(vt_rot2(1,:),vt_rot2(2,:),'k-^');

title('Original blue, rot1 red, rot2 black');

axis([-2 2 -2 2]);

drawnow;

pause(0.1);

end;

Reflection

The last major geometrical linear transformation is reflection about an angle theta:

The matrix for a 2-dimensional reflection about theta is:

[cos(2* theta) sin(2* theta) ; sin(2* theta) -cos(2* theta) ];

Once again, I've written a function for this transformation (save in the folder in tools called math):

refl2d.m

function r = refl2d(theta)

% REFL2D - 2D reflection matrix

%

% R = dasw.math.refl2d(THETA)

%

% Returns R = [cos(2*THETA) sin(2*THETA) ; sin(2*THETA) -cos(2*THETA) ];

r = [cos(2*theta) sin(2*theta) ; sin(2*theta) -cos(2*theta) ];

We can explore this transformation in the same way we did for rotation:

vt_refl1 = dasw.math.refl2d(45*pi/180)*vt; % 45 degrees, converted to radians

vt_refl2 = dasw.math.refl2d(45*pi/180)*vt_submean;

figure(104);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

plot(vt_refl1(1,:),vt_refl1(2,:),'r-x');

plot(vt_refl2(1,:),vt_refl2(2,:),'k-^');

title('Original blue, refl1 red, refl2 black');

And again we can create a fun animation to reflect this around the clock.

angles = 0:5:360;

h1 = [];

h2 = [];

for i=1:length(angles),

if ~isempty(h1), delete(h1); end;

if ~isempty(h2), delete(h2); end;

vt_refl1 = dasw.math.refl2d(angles(i)*pi/180)*vt; % degrees, converted to radians

vt_refl2 = dasw.math.refl2d(angles(i)*pi/180)*vt_submean;

figure(105);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

h1 = plot(vt_refl1(1,:),vt_refl1(2,:),'r-x');

h2 = plot(vt_refl2(1,:),vt_refl2(2,:),'k-^');

title('Original blue, refl1 red, refl2 black');

axis([-2 2 -2 2]);

drawnow;

pause(0.1);

end;

Compositions of transformations

A beautiful thing about linear transformations is that they can be composed super easily: it's just matrix multiplication. Here's an example of a scaling, rotation, and reflection in 1 line:

vt_trans = [5 0 ; 0 2] * dasw.math.rot2d(45*pi/180) * dasw.math.refl2d(0) * vt_submean

figure(106);

plot(vt(1,:),vt(2,:),'b-o');

hold on;

plot(vt_trans(1,:),vt_trans(2,:),'r-x');

title('Original blue, trans red');

axis equal % use same size of x and y axis

Now that we are armed with these linear and affine transformations (which have many many applications) we will be equipped for dimensionality reduction in Lab 5.2.

Let's now turn to looking at higher dimensional spaces (that is, when the dimension is greater than 2).

Plotting in higher dimensional spaces

It is quite common in science to generate data that has 3 or dimensions, but making sense out of such data and showing the results to others presents special challenges.

Continuously sampled data data (such as time series); surfaces

Consider the zebra finch song that we analyzed in Lab 3.2 (attached below). The spectrogram of the zebra finch song is essentially a 3-d piece of information; at each point in time (1), we have a power/intensity value (2) at each frequency (3).

Recall:

[wave,fs] = audioread('song.wav');

figure;

spectrogram(wave,256,[],[],fs,'yaxis');

(Leave this figure open.) The spectrogram method demonstrates a powerful way to display 3d information by essentially reducing it down to 2 dimensions in the form of an image (using color to code for power). One could look at a spectrogram in a paper and interpret it easily.

The spectrogram actually function plots a surface, which is a 3-dimensional structure. To illustrate how to do this in Matlab, let's provide output arguments to the spectrogram function, so we can get the data and plot it ourselves using the same functions that spectrogram calls.

[s,f,t,p] = spectrogram(wave,256,[],[],fs,'yaxis');

size(f) % the frequency dimension

size(t) % the time dimension

size(p) % the power value for each time, frequency

The Matlab function that plots surfaces is called surf (see help surf).

figure;

surf(t,f,10*log10(abs(p)),'EdgeColor','none');

axis xy; axis tight; colormap(jet);

xlabel('Time');

ylabel('Frequency (Hz)');

Q2: Look at the 3D view. You can use the "Rotate 3-D" tool (the box with the circular arrow around it). Which view do you prefer, the 2-D view or the 3-D view? If you were going to put an image in a paper, which do you think would be easier for readers to digest?

It is possible to adjust the angle of view on the command line using the view command (see help view):

view(0,90); % changes to 2d perspective

view(-37.5, 30); % changes to default 3-d view

There are a lot of options in Matlab for viewing data in 3-D. A 3-d surface plot can be combined with a contour plot:

figure;

surfc(t,f,10*log10(abs(p)),'EdgeColor','none');

axis xy; axis tight; colormap(jet);

xlabel('Time');

ylabel('Frequency (Hz)');

or to use another bit of example data (derived from the peaks function in Matlab, which is an example surface for demoing the suface commands):

figure;

[X,Y,Z] = peaks(30);

surfc(X,Y,Z)

colormap hsv

axis([-3 3 -3 3 -10 5])

High-dimensional data points

Finally, often we have individual data points with many parameters. As an example, there is a famous set of data of 3 different Iris species (Iris setosa, Iris versicolor, and Iris virginica) that was collected by Edgar Anderson and made famous by RA Fisher, who used the data to demonstrate linear discriminate analysis. (Check the link to see pictures of the 3 flower species.) Anderson measured 4 parameters of several flowers of each species: sepal length, sepal width, petal length, and petal width. Therefore, each flower is a 4-dimensional data point.

iris_setosa = load('iris_setosa.txt','-ascii')

iris_versicolor = load('iris_versicolor.txt','-ascii')

iris_virginica = load('iris_virginica.txt','-ascii')

How should we plot this data?

We could use the built-in function plot3d (see help plot3d) and examine 3-dimensions at a time.

figure;

plot3(iris_setosa(:,1),iris_setosa(:,2),iris_setosa(:,3),'bo');

hold on;

plot3(iris_versicolor(:,1), iris_versicolor(:,2), iris_versicolor(:,3),'rx');

plot3(iris_virginica(:,1), iris_virginica(:,2), iris_virginica(:,3),'k^');

xlabel('Sepal length');

ylabel('Sepal width');

zlabel('Petal length');

Q3: How do these 3 species differ? (You might need to use the Rotate 3-D tool to appreciate the different parameters.)

But, you complain, this is lame, we are leaving out one of the dimensions of the data. How can we represent all data points at once?

High-dimensional scatterplot

We can do this with a multidimensional scatter plot of the data. We first encountered scatter plots of 2-dimensions in Lab 2.1. Recall we can plot

figure;

plot(iris_setosa(:,1),iris_setosa(:,2),'bo');

hold on;

plot(iris_versicolor(:,1), iris_versicolor(:,2),'rx');

plot(iris_virginica(:,1), iris_virginica(:,2),'k^');

xlabel('Sepal length');

ylabel('Sepal width');

To perform an N dimensional scatterplot in a Matlab figure, I use a routine written by a former colleague Maneesh Sahani called scatterplot.m. It relies on another function called assign.m. Both are attached at the end of this lab. Place scatterplot.m in the +dasw/+plot package inside your tools folder.

help dasw.plot.scatterplot % spend a minute reading the documentation

% label groups, just like for ANOVA

G = [repmat(1,size(iris_setosa,1),1);

repmat(2,size(iris_versicolor,1),1);

repmat(3,size(iris_virginica,1),1);]

data = [iris_setosa;iris_versicolor;iris_virginica];

figure;

dasw.plot.scatterplot(data,'class',G,'markersize',8);

legend('Setosa','Versacolor','Virginica','Location','NorthOutside')

Take your mouse and move the legend out of the way so all the plots can be full size.

The scatterplot plots every dimension against every other dimension. The upper left graph has dimension 2 (sepal width, on the y axis) plotted against dimension 1 (sepal length, on the x axis). If you look down the left hand column of graphs, you see that the dimension 1 (sepal length) is consistent across the row, allowing you to see how the other dimensions (2, 3, and 4) co-vary with dimension 1. In the second column, dimension 2 is on the x axis, and dimensions 3 and 4 are on the y axis. By comparing the points across the graph rows and columns, you can get a sense of how the data looks in multi-dimensional space.

Q4: Describe how iris setosa differs from iris versicolor and iris virginica in all 4 dimensions. How do iris versicolor and iris virginica differ (in all 4 dimensions)?

Function reference

Matlab functions and operators

User functions

  • dasw.math.rot2d.m

  • dasw.math.refl2d.m

  • dasw.plot.scatterplot.m

  • assign.m