IntroductionUp to now, we have examined experimental data that is derived from simple experiments with 2 or more groups of data. Each data point has been a single value, such as a height, or a blood-glucose concentration. In other experiments, we want to explore the relationship between 2 quantities. This type of analysis is sometimes called phenomenological modeling, or, when we imagine that a relationship is linear, regression. The procedure used to find the best model between 2 or more quantities is called fitting. The process of developing phenomenological models (or "fits") can address 1 or more of several questions:
ScatterplotsThe primary plot type that one can use to examine relationships between 2 or more independent variables is the scatterplot. (We'll look at 3-dimensional and higher spaces in unit 5, for now we'll stick to 2-D.) Let's look at some scatterplots: Here from Wikipedia::Scatterplot we see the relationship between eruption interval and eruption durations of Old Faithful, a geyser at Yellowstone National Park. Let's make our own scatterplot with some data of manatee deaths by powerboat accident and the number of powerboat registrations in the state of Florida (download the file manatee_powerboat.txt below). This dataset is from Baldi, and has 3 columns: the year, powerboat registrations in thousands, and manatee deaths. florida = load('manatee_powerboat.txt','-ascii'); florida, % see the data figure; plot(florida(:,2),florida(:,3),'ko'); % plot with black circles xlabel('Powerboats registered (thousands)'); ylabel('Manatee deaths from powerboat collision'); box off; % turn off the ugly extra box Q1: In a particular year, there were about 755,000 registered powerboats. By reading the scatterplot, how many manatee deaths were there in that same year? Q2: Is there a relationship between powerboat registrations and manatee deaths over the years reported here? What is it? Q3: Does this relationship indicate that powerboat registrations cause manatee deaths, or merely that they are correlated? CovarianceOne question we often want to answer is the following: "how much does knowing the value of 1 variable tell me about the value of the other variable?" For example, how much does knowing the number of powerboat registrations tell us about the number of manatee deaths? Recall that in lab 1.8 we calculated the amount of the variance in a set of data could be explained by group membership (the example we used was how much of the variance in heights could be explained by sex). There is an analogous calculation for determining the amount of total variance among the 2 variables is explained by knowing one of the values. To calculate this variable, we need to define a new quantity called the covariance, which describes how the product of the 2 variables varies in relation to the mean of the 2 variables. The following is the best estimator of the "true" covariation between a sample of 2 variables x and y (with each sample represented by xi, yi). (Just like for variance (defined in Lab 1.7), there is a "sample covariance" that is identical except that one divides by n instead of n-1). Let's explore the covariance. Before we analyze real data, let's create some artificial data where we "contrive" the relationship between the variables. Let's create 3 sets of variables. In the first, y = x; that is, knowing x completely tells you y and vice-versa. In the second, x and y are both random; knowing 1 tells you nothing about the other. In the third, y = -x; knowing y tells you the value of x but you have to multiply by -1 first. x1 = rand(20,1) % creates an array of random numbers distributed between 0 and 1 y1 = x1; % y1 is x1 x2 = rand(20,1); y2 = rand(20,1); x3 = rand(20,1); y3 = -x3; Now let's quickly plot these variables on a graph. We'll take advantage of our new knowledge of Matlab graphics handles to put them all on 1 figure. figure; subplot(2,2,1); plot(x1,y1,'ko'); ylabel('y1'); xlabel('x1'); subplot(2,2,2); plot(x2,y2,'ko'); ylabel('y2'); xlabel('x2'); subplot(2,2,3); plot(x3,y3,'ko'); ylabel('y3'); xlabel('x3'); Covariance is computed by the function cov in Matlab, but it does something you might not expect. Rather than computing the covariance between 2 variables and returning that, it actually computes the variance or covariance for every combination of the variables you supply: So try this: cov(x2,y2) You see a 2x2 matrix. You'll notice that the diagonals of the matrix are just the variance of x2 and y2 alone. Try it: var(x2) var(y2) See? The off-diagonals are the covariances between x2 and y2. Q4: Is there a difference between ŝ(x2,y2) and ŝ(y2,x2)? Did you expect a difference from the equation? The result you find here holds in general. Now compare the covariance matrixes from these different cases: C1 = cov(x1,y1) C2 = cov(x2,y2) C3 = cov(x3,y3) Q5: What are ŝ(x1,y1), ŝ(x2,y2), and ŝ(x3,y3)? Let's unpack this equation just a little bit more: If y and x have anything to do with one another, then when a sample yi has a positive deviation from its mean, then the corresponding point xi will also tend to have a positive deviation from its mean, and this quantity ŝ(x,y) will tend to be positive. If x and y have nothing to do with one another, then when an xi happens to have a positive deviation from its mean, yi's deviation will be unrelated (either positive or negative), and on average the quantity ŝ(x,y) will be close to 0. CorrelationTo answer our question, how much does knowing the value of 1 variable tell you about the value of the other variable, we need to normalize the covariance by the total amount of variance there possibly could be to explain. This factor is called the correlation coefficient r (sometimes called Pearson's r after its inventor), and the equation is: Let's directly calculate the correlation coefficient from the covariance matrixes above: r_1 = C1(1,2)/sqrt(C1(1,1)*C1(2,2)) r_2 = C2(1,2)/sqrt(C2(1,1)*C2(2,2)) r_3 = C3(1,2)/sqrt(C3(1,1)*C3(2,2)) In general, the correlation coefficient r varies between -1 and 1, and indicates the fraction of the total variance in 1 variable is explained by knowledge of the other variable. If you aren't going to compute the covariance matrix for other reasons, and you only want to know the r value, then you can obtain that value directly with the function corrcoef, which also returns a matrix: Let's try it: corrcoef(x1,y1) corrcoef(x2,y2) corrcoef(x3,y3) corrcoef(florida(:,2),florida(:,3)) Q6: What is the correlation coefficient r between the number of powerboat registrations and manatee deaths? Significance of correlation: It is also possible to put a P-value on the significance of the correlation. It is possible for 2 variables to be correlated just by chance, particularly if the number of samples are low. We won't go over the math here, but the P value on the significance of correlation is returned in the second output argument of the function corrcoef: [R,p] = corrcoef(florida(:,2),florida(:,3)) Q7: Can you reject, with an alpha of 0.05, that the correlation that you observed between manatee deaths by powerboats and the number of powerboat registrations is simply due to sampling variations? That is, can you say with 95% confidence, that this correlation is likely to be significantly greater than 0? Correlation and causationIf 2 variables are correlated, does it follow directly that they are causally linked? For example, it is true that the number of violent deaths in a city is highly correlated with the number of pharmacists in a city. Does this imply a causal link? (That is, are the pharmacists causing the violent deaths?) Q8: Describe a scenario where the manatee deaths might not be caused by the number of powerboat registrations. Invent a "hidden variable", not measured in the graph, that might explain the correlation without causation. Linear Fits, aka "regression"The most elementary way we can relate one variable to another in a phenomenological model is by a linear fit. Recall that a line has the equation: y = m*x + b, where m is the slope, and b is the y-intercept (that is, the value of y when x is 0). When choosing the "best-fit" line, we choose the line that minimizes the squared error between the model and the data. For the special case of linear fits, this equation has been worked out explicitly, and it's quite beautiful: Let's try it for our Florida data. First we will replot: x = florida(:,2); y = florida(:,3); figure; plot(x,y, 'bo'); xlabel('Powerboats registered (thousands)'); ylabel('Manatee deaths from powerboat collision'); box off; [R,p] = corrcoef(x,y) C = cov(x,y); m = R(1,2)*sqrt(C(2,2)/C(1,1)) b = mean(y) - m*mean(x) X = [ 400 1100]; % chosen to match our X-axis of interest Y = m * X + b hold on plot(X,Y,'k--'); Let's calculate the error between "the fit" and the data. Let's denote the fit by a function F(x) that can be evaluated at any point x. Then, we have the raw data x_i values (in this case, the powerboats registered), the raw y_i values (manatee deaths here), and the values of y that would be predicted by the fit (F(x_i) values). To see how good of a job our fit did in describing the data, we can calculate the "mean squared error" between the y_i values and the F(x_i) values: which in our case is: Y_fit = m * x + b; error = y-Y_fit We'd like to calculate the error squared on an element by element basis. To do this, we can use the . (pronounced "dot", but not the same as the function called "dot") operator to perform element by element arithmetic. To use the . operator, we write the . followed by the operation we want to perform element by element, such as multiplication: [1 2 3].*[4 5 6] Let's apply this to the squared error: squarederror = error.*error;% OR, error.^2 is el-by-el raising to 2nd power meanerr = mean(squarederror) Now let's try a slightly different function: m1 = m + 0.05; b1 = b + 1; Y_fit1 = m1 * x + b1;meanerr2 = mean((y-Y_fit1).^2); Let's plot this less good function, too: plot([400 1100],[400 1100]*m1+b1,'g--'); Why do we minimize the squared error between the data and the fit?The reason is that minimizing the squared error gives the best estimate of the "mean" behavior of the data. There is a mathematical proof of this fact that we won't go over here (but check it out for the case of linear regression here). If you minimize the absolute value of the difference between the data and the fit, then you will get the best estimate of the median of the data. Traditionally, however, most investigators minimize the squared error. Non-linear modelsMatlab also has powerful tools for performing non-linear fitting. We will briefly start non-linear fitting here and we will dig much deeper into this in the next lab. But for now let's get warmed up. Consider the dataset census (built-in to Matlab), which shows the population of the United States (in millions) as a function of year. (This is modified from some instructional material from the Mathworks.) load census % loads variables 'cdate' and 'pop' figure; plot(cdate,pop,'o') ylabel('Population'); xlabel('Census year'); hold on; Suppose we wanted to fit this to an exponential curve: f(x) = a*(x-b)^2. That is, we want to find the best values of a and b that minimize the squared error between the fit and the data. In non-linear fitting, we can't assume that anyone has solved for a and b analytically; usually, this is not the case. Instead, we want the computer to search for the best numbers by trying lots of combinations and finding the best fit. There are important limits to one's ability to do this, and we will focus on these limits next time. We will call a series of functions to define our function and the search parameters. The first is fitoptions, which allows us to specify the method and lower and upper bounds for a and b, and where the algorithm should start searching (we'll specify a=1, b=1): s = fitoptions('Method','NonlinearLeastSquares',... % use squared error 'Lower',[0,0],... % lower bounds for [a b] 'Upper',[Inf,max(cdate)],... % upper bounds for [a b] 'Startpoint',[1 1]); % startpoint, search will start at a=1,b=1 Next, we'll build our custom fit function using the function fittype: f = fittype('a*(x-b)^2','options',s); Finally, we'll perform the fit with the function aptly named fit: [c,gof] = fit(cdate,pop,f) Look at all the beautiful values fit returns in c and gof. It gives 95% confidence intervals on the best values of a and b, and gives you an r2 value to tell you how much of the total variance in the data that is captured by the fit. Now we can plot the fit with the raw data: x_values = 1750:2000; y_values = c(x_values); % this returns c evaluated with best a, b hold on; plot(x_values,y_values,'b--'); Q9: What is the rsquared value for the non-linear fit of the census data? Function referenceMatlab functions and operatorsUser functions
Labs are copyright 2011-2012, Stephen D. Van Hooser, all rights reserved. |
Labs >