Lab 2.1 Describing data with fits (models)
Introduction
Up to now, we have examined experimental data that is derived from simple experiments with 2 or more groups of data. Each data point has been a single value, such as a height, or a blood-glucose concentration.
In other experiments, we want to explore the relationship between 2 quantities. This type of analysis is sometimes called phenomenological modeling, or, when we imagine that a relationship is linear, regression. The procedure used to find the best model between 2 or more quantities is called fitting.
The process of developing phenomenological models (or "fits") can address 1 or more of several questions:
Are the values of 2 (or more) variables related (known as correlated)?
How much does knowing the value of 1 variable tell you about the value of other variables? This is useful both for uncovering the phenomenological relationships underlying correlations, as well as in prediction and estimation when 1 variable might be unknown.
When 1 variable is varied in an independent fashion, it is sometimes possible to uncover the functional rules underlying the relationship between the independent variable and other variables.
Scatterplots
The primary plot type that one can use to examine relationships between 2 or more independent variables is the scatterplot. (We'll look at 3-dimensional and higher spaces in unit 5, for now we'll stick to 2-D.) Let's look at some scatterplots:
Here from Wikipedia::Scatterplot we see the relationship between eruption interval and eruption durations of Old Faithful, a geyser at Yellowstone National Park.
Let's make our own scatterplot with some data of manatee deaths by powerboat accident and the number of powerboat registrations in the state of Florida (download the file manatee_powerboat.txt below). This dataset is from Baldi, and has 3 columns: the year, powerboat registrations in thousands, and manatee deaths.
florida = load('manatee_powerboat.txt','-ascii');
florida, % see the data
figure;
plot(florida(:,2),florida(:,3),'ko'); % plot with black circles
xlabel('Powerboats registered (thousands)');
ylabel('Manatee deaths from powerboat collision');
box off; % turn off the ugly extra box
Q1: In a particular year, there were about 755,000 registered powerboats. By reading the scatterplot, how many manatee deaths were there in that same year?
Q2: Is there a relationship between powerboat registrations and manatee deaths over the years reported here? What is it?
Q3: Does this relationship indicate that powerboat registrations cause manatee deaths, or merely that they are correlated?
Covariance
One question we often want to answer is the following: "how much does knowing the value of 1 variable tell me about the value of the other variable?" For example, how much does knowing the number of powerboat registrations tell us about the number of manatee deaths?
Recall that in lab 1.8 we calculated the amount of the variance in a set of data could be explained by group membership (the example we used was how much of the variance in heights could be explained by sex). There is an analogous calculation for determining the amount of total variance among the 2 variables is explained by knowing one of the values.
To calculate this variable, we need to define a new quantity called the covariance, which describes how the product of the 2 variables varies in relation to the mean of the 2 variables. The following is the best estimator of the "true" covariation between a sample of 2 variables x and y (with each sample represented by xi, yi). (Just like for variance (defined in Lab 1.7), there is a "sample covariance" that is identical except that one divides by n instead of n-1).
Let's explore the covariance. Before we analyze real data, let's create some artificial data where we "contrive" the relationship between the variables.
Let's create 3 sets of variables. In the first, y = x; that is, knowing x completely tells you y and vice-versa. In the second, x and y are both random; knowing 1 tells you nothing about the other. In the third, y = -x; knowing y tells you the value of x but you have to multiply by -1 first.
x1 = rand(20,1) % creates an array of random numbers distributed between 0 and 1
y1 = x1; % y1 is x1
x2 = rand(20,1);
y2 = rand(20,1);
x3 = rand(20,1);
y3 = -x3;
Now let's quickly plot these variables on a graph. We'll take advantage of our new knowledge of Matlab graphics handles to put them all on 1 figure.
figure;
subplot(2,2,1);
plot(x1,y1,'ko');
ylabel('y1'); xlabel('x1');
subplot(2,2,2);
plot(x2,y2,'ko');
ylabel('y2'); xlabel('x2');
subplot(2,2,3);
plot(x3,y3,'ko');
ylabel('y3'); xlabel('x3');
Covariance is computed by the function cov in Matlab, but it does something you might not expect. Rather than computing the covariance between 2 variables and returning that, it actually computes the variance or covariance for every combination of the variables you supply:
So try this:
cov(x2,y2)
You see a 2x2 matrix. You'll notice that the diagonals of the matrix are just the variance of x2 and y2 alone. Try it:
var(x2)
var(y2)
See?
The off-diagonals are the covariances between x2 and y2.
Q4: Is there a difference between Ĺť(x2,y2) and Ĺť(y2,x2)? Did you expect a difference from the equation? The result you find here holds in general.
Now compare the covariance matrixes from these different cases:
C1 = cov(x1,y1)
C2 = cov(x2,y2)
C3 = cov(x3,y3)
Q5: What are Ĺť(x1,y1), Ĺť(x2,y2), and Ĺť(x3,y3)?
Let's unpack this equation just a little bit more:
If y and x have anything to do with one another, then when a sample yi has a positive deviation from its mean, then the corresponding point xi will also tend to have a positive deviation from its mean, and this quantity Ĺť(x,y) will tend to be positive. If x and y have nothing to do with one another, then when an xi happens to have a positive deviation from its mean, yi's deviation will be unrelated (either positive or negative), and on average the quantity Ĺť(x,y) will be close to 0.
Correlation
To answer our question, how much does knowing the value of 1 variable tell you about the value of the other variable, we need to normalize the covariance by the total amount of variance there possibly could be to explain. This factor is called the correlation coefficient r (sometimes called Pearson's r after its inventor), and the equation is:
Let's directly calculate the correlation coefficient from the covariance matrixes above:
r_1 = C1(1,2)/sqrt(C1(1,1)*C1(2,2))
r_2 = C2(1,2)/sqrt(C2(1,1)*C2(2,2))
r_3 = C3(1,2)/sqrt(C3(1,1)*C3(2,2))
In general, the correlation coefficient r varies between -1 and 1, and indicates the fraction of the total variance in 1 variable is explained by knowledge of the other variable.
If you aren't going to compute the covariance matrix for other reasons, and you only want to know the r value, then you can obtain that value directly with the function corrcoef, which also returns a matrix:
Let's try it:
corrcoef(x1,y1)
corrcoef(x2,y2)
corrcoef(x3,y3)
corrcoef(florida(:,2),florida(:,3))
Q6: What is the correlation coefficient r between the number of powerboat registrations and manatee deaths?
Significance of correlation:
It is also possible to put a P-value on the significance of the correlation. It is possible for 2 variables to be correlated just by chance, particularly if the number of samples are low. We won't go over the math here, but the P value on the significance of correlation is returned in the second output argument of the function corrcoef:
[R,p] = corrcoef(florida(:,2),florida(:,3))
Q7: Can you reject, with an alpha of 0.05, that the correlation that you observed between manatee deaths by powerboats and the number of powerboat registrations is simply due to sampling variations? That is, can you say with 95% confidence, that this correlation is likely to be significantly greater than 0?
Correlation and causation
If 2 variables are correlated, does it follow directly that they are causally linked? For example, it is true that the number of violent deaths in a city is highly correlated with the number of pharmacists in a city. Does this imply a causal link? (That is, are the pharmacists causing the violent deaths?)
Q8: Describe a scenario where the manatee deaths might not be caused by the number of powerboat registrations. Invent a "hidden variable", not measured in the graph, that might explain the correlation without causation.
Linear Fits, aka "regression"
The most elementary way we can relate one variable to another in a phenomenological model is by a linear fit. Recall that a line has the equation: y = m*x + b, where m is the slope, and b is the y-intercept (that is, the value of y when x is 0).
Why is linear analysis called regression?
From the Baldi and Moore text:
To "regress" means to go backward. Why are statistical methods for predicting a response from an explanatory variable called "regression"? Sir Francis Galton (1822-1911), who was the first to apply regression to biological and psychological data, looked at examples such as the heights of of their parents. He found that taller-than-average parents tended to have children who were also taller than average but not as tall as their parents. Galton called this fact "regression towards the mean" and the name came to be applied to the statistical method.
So, from what I gather, it's called regression because one particular data set that was studied many years ago regressed to the overall population mean. That's a pretty lame name; not only does it not describe what the technique is, it asserts a proposition (that things "regress" to the mean) that isn't generally true. The technique should probably be called "linear correlation analysis" or something like that. But most software packages call this regression, so we have to learn this very irritating (to me!) term.
When choosing the "best-fit" line, we choose the line that minimizes the squared error between the model and the data. For the special case of linear fits, this equation has been worked out explicitly, and it's quite beautiful:
such that,
and
.
Let's try it for our Florida data. First we will replot:
x = florida(:,2);
y = florida(:,3);
figure;
plot(x,y, 'bo');
xlabel('Powerboats registered (thousands)');
ylabel('Manatee deaths from powerboat collision');
box off;
[R,p] = corrcoef(x,y)
C = cov(x,y);
m = R(1,2)*sqrt(C(2,2)/C(1,1))
b = mean(y) - m*mean(x)
X = [ 400 1100]; % chosen to match our X-axis of interest
Y = m * X + b
hold on
plot(X,Y,'k--');
Let's calculate the error between "the fit" and the data. Let's denote the fit by a function F(x) that can be evaluated at any point x. Then, we have the raw data x_i values (in this case, the powerboats registered), the raw y_i values (manatee deaths here), and the values of y that would be predicted by the fit (F(x_i) values). To see how good of a job our fit did in describing the data, we can calculate the "mean squared error" between the y_i values and the F(x_i) values:
which in our case is:
Y_fit = m * x + b;
error = y-Y_fit
We'd like to calculate the error squared on an element by element basis. To do this, we can use the . (pronounced "dot", but not the same as the function called "dot") operator to perform element by element arithmetic. To use the . operator, we write the . followed by the operation we want to perform element by element, such as multiplication:
[1 2 3].*[4 5 6]
Let's apply this to the squared error:
squarederror = error.*error;% OR, error.^2 is el-by-el raising to 2nd power
meanerr = mean(squarederror)
Now let's try a slightly different function:
m1 = m + 0.05; b1 = b + 1;
Y_fit1 = m1 * x + b1;
meanerr2 = mean((y-Y_fit1).^2);
Let's plot this less good function, too:
plot([400 1100],[400 1100]*m1+b1,'g--');
Why do we minimize the squared error between the data and the fit?
The reason is that minimizing the squared error gives the best estimate of the "mean" behavior of the data. There is a mathematical proof of this fact that we won't go over here (but check it out for the case of linear regression here). If you minimize the absolute value of the difference between the data and the fit, then you will get the best estimate of the median of the data. Traditionally, however, most investigators minimize the squared error.
Non-linear models
Matlab also has powerful tools for performing non-linear fitting. We will briefly start non-linear fitting here and we will dig much deeper into this in the next lab. But for now let's get warmed up.
Consider the dataset census (built-in to Matlab), which shows the population of the United States (in millions) as a function of year. (This is modified from some instructional material from the Mathworks.)
load census % loads variables 'cdate' and 'pop'
figure;
plot(cdate,pop,'o')
ylabel('Population');
xlabel('Census year');
hold on;
Suppose we wanted to fit this to an exponential curve: f(x) = a*(x-b)^2. That is, we want to find the best values of a and b that minimize the squared error between the fit and the data.
In non-linear fitting, we can't assume that anyone has solved for a and b analytically; usually, this is not the case. Instead, we want the computer to search for the best numbers by trying lots of combinations and finding the best fit. There are important limits to one's ability to do this, and we will focus on these limits next time.
We will call a series of functions to define our function and the search parameters. The first is fitoptions, which allows us to specify the method and lower and upper bounds for a and b, and where the algorithm should start searching (we'll specify a=1, b=1):
s = fitoptions('Method','NonlinearLeastSquares',... % use squared error
'Lower',[0,0],... % lower bounds for [a b]
'Upper',[Inf,max(cdate)],... % upper bounds for [a b]
'Startpoint',[1 1]); % startpoint, search will start at a=1,b=1
Some notes on fitoptions
Generally, we want Matlab to calculate the squared error between the data and the fit, so we specify the use of NonlinearLeastSquares as the error function (note that this has nothing to do with whether or not the fit itself is linear; rather, it specifies how the error between the data and the fit is calculated).
In this case 'Lower' is a 2 element vector because there are 2 free parameters in our equation (a and b). If we had wanted to allow the search to explore the full range of values from negative Infinity to Infinity for a and 0 to max(cdate) for b, we'd use 'Lower', [-Inf 0], 'Upper', [Inf max(cdate)]. If our equation had more free parameters, we would need to make 'Lower' and 'Upper' have more elements to match. You might wonder how the program knows which parameter is associated with Lower/Upper value: by default, the parameters are taken in alphabetical order (a is first, b is second, etc). Finally, it is not necessary to specify a Startpoint; if you don't include that option, then the search proceeds from a random starting position.
Next, we'll build our custom fit function using the function fittype:
f = fittype('a*(x-b)^2','options',s);
Finally, we'll perform the fit with the function aptly named fit:
[c,gof] = fit(cdate,pop,f)
Look at all the beautiful values fit returns in c and gof. It gives 95% confidence intervals on the best values of a and b, and gives you an r2 value to tell you how much of the total variance in the data that is captured by the fit.
r^2: Coefficient of Determination
But wait, what is the difference between r and r2?
r shows the correlation between two variables in a set of observed data. The closer r is to 1, the more correlated the two variables are (remember the powerboat registrations and manatee deaths?) r can also be -1, which indicates a negative correlation (for example, if the number of manatee deaths decreased as the number of powerboat registrations increased, this would be a negative correlation).
r2 is the fraction of total variance that is accounted for by a particular model. The closer r2 is to 1, the better the model fits the observed data. However, unlike r, a value less than 0 for r2 means the model does an even worse job than just stating the mean!
For a linear model, the equations for the fit are already known and r2 is really just r squared. However, this is not true for non-linear models, but we still call it r2. r2 is normally defined as
r2 = 1 - (Square Error Between Model and Data)/(Total Variance of Data Around the Mean)
We can write down the definition of r2 using the notation above:
Now we can plot the fit with the raw data:
x_values = 1750:2000;
y_values = c(x_values); % this returns c evaluated with best a, b
hold on;
plot(x_values,y_values,'b--');
Q9: What is the rsquared value for the non-linear fit of the census data?
Function reference
Matlab functions and operators
User functions
[no new functions]
Labs are copyright 2011-2012, Stephen D. Van Hooser, all rights reserved.