Lab 2.2 Using fitting as a tool, not like a fool
In the last lab, we saw that we could use correlation to examine if there were significant linear relationships between 2 variables, and we examined both linear and non-linear fits of data.
Now you might be thinking: "Hot dog! I can fit my data to any function I want just by using the fit function. I will be able to explain everything in terms of an equation."
Unfortunately, the process of fitting has several real limits, for sensible reasons. In this lab, we will learn how to get the most out of fitting without falling into some of its more common traps.
How fitting is accomplished under the hood: local optimization
Suppose we were mapping the response field of a nerve fiber in someone's arm. We want to know the extent of the area on the skin that gives a response, perhaps with a small probe. We might want to fit the data to a normal (gaussian) distribution, because this function has a nice central peak that falls off with distance. Here is the data, in the file armdata.txt.
Now let's fit the data to a gaussian. We saw a gaussian function in Problem Set 1.2, Q1.3.1-2. (If you are only doing the labs and not doing the homework, take a minute to do Q1.3.1 of Problem Set 1.2 to get introduced to the gaussian function.) To do this, we are going to pull out the list of fitting options. There are a number of functions that we can use with variables of the fittype data type. To get a list, you can type
We will use the function fitoptions to read the fit options. This list of fitting options is a data type that we haven't seen before, called a structure:
We will learn how to make our own structures down the road, but for now, all you need to know is that you can access the 'fields' of a structure using a period (dot). (Yes, strangely, the same symbol that you use to perform element-by-element arithmetic, but it does not have the same meaning when the variable is a structure.)
We'll edit the 'StartPoint' field to give an initial guess of our values a, b, c, and d:
Q1: What does the fit look like? The variable nervegof is the "goodness of fit" parameters; it is a data structure with multiple entries, including the summed squared error between the data and the fit (parameter sse). What is the summed squared error between the data and the fit in this case? Is the fit what you expected?
Let's design our fit so we can watch it happen over time. What will happen is that the fitting algorithm will start off with our initial guess, and vary each parameter up and down slightly to see which direction reduces the error between the fit and the data. Then, it will adjust its guess slightly in that direction, and keep moving, until the error is as small as it can get. So that we can see the process happen, we'll pause with the pause command (which pauses the number of seconds requested). Here is a function to perform the fitting while we watch (watchfithappen.m, I recommend putting it in your stats folder):
Now let's try using it to fit our data:
Leave this window open for the moment.
Q2: Does the error reach an apparent minimum point during this parameter search procedure?
BUT, you protest, you know, if it would just search for a value of c near 6, it would find a much better answer. Let's try it:
Q3: Is the squared error better or worse with the better starting position?
You might ask, why doesn't the computer just try every possible combination of values for these variables to identify the "global" minimum error, rather than just searching locally? The answer is that you don't have the time for it to try every combination of all variables. Suppose the computer could evaluate 1000 fits per second. Suppose there are 4 variables to your fit, and suppose you evaluate all 2^64 possible real-number values (for the default variable type, which is a double). Then the total time it would take is ((2^64)^4/1000) seconds, or about 1.34x10^69 years, much, much, much longer than the age of the universe (about 13.7 x 10^9 years). So we can't quite do that.
Local error minima vs. the global minimum error
Rules of thumb for finding the global minimum error:
Use the simplest equation you can that describes your function; adding more parameters increases the likelihood that your fits will get stuck in a local error minimum
If possible, give the fitting function very good guesses as to the initial starting conditions that should be used
Repeat the fitting process from a variety of initial starting conditions; compare the results to see which starting position had the least error
If you know something about the limits of your variables (for example, if a variable in your equation cannot physically be less than 0, but mathematically it could be), make sure to include this information in your fit options.
Okay, let's try a few of these solutions for the problem at hand.
First, let's make a guess as to the starting values for our search. Let's base our guess on the data itself. Let's guess mean(y) for the constant offset parameter (a), and max(y) for the peak parameter (b). Let's find the location of the peak and make that our guess for
Second, although it is mathematically possible for a gaussian to appear upside-down, that is not expected in this situation, so let's limit the peak to being positive. Further, let's limit c to being no wider than our current data.
Let's look at how we do this. Examine the available fit options parameters:
should pop up something like this:
Fields of fit options structure (don't type this, this is what you should see):
We can set the upper and lower limits of the variables a, b, c, and d. Let's restrict a to be between -maxvalue and positive maxvalue (the constant offset doesn't need to go outside these bounds), and restrict b to be positive, and c to be within our bounds. We have less information here as to how to restrict d, so, for the moment, we won't.
Now let's perform the fit:
Q4: Did our guesses work?
Garbage in, garbage out
As you've already seen, one of the most important rules pertaining to any analysis process is that when you put in data that doesn't conform to the assumptions of the process, then you get meaningless data out the other side. Here are a couple common situations in fitting:
Situation 1a: the function doesn't describe the data very well
Let's consider the function we used to fit the population data in Lab 2.1: f(x) = a*(x-b)^2 . Suppose that instead of trying to use this function to fit population data (which inherently has an exponential shape), we tried to fit sinusoidal data.
Q5: Do you think that the fit function did the best job it could fitting the data to the quadratic function that was requested? What are the confidence intervals on the parameters a and b? Does the fit do a good job of accounting for the variance in the data? What is rsquared? Should you use this function for fitting this data?
Solution to this situation: pick a good function and verify it is a good choice on a lot of your example data.
Situation 1b: the function doesn't describe the data very well
There is another way in which the the function might not describe the data well. Suppose, we are trying to sample a relatively unresponsive nerve in someone's arm, and we obtain the data in morearmdata.txt below.
Q6: Did you still get a value? Did it fit anything? Would you report that the nerve preferred position c and had a width parameter of d? Does that seem like the right thing to do?
Solution to this situation: Consider some sort of statistical test to reject noisy cells from being fit in the first place; an easy test is that the maximum value might need to be at least 5 standard deviation values above the noise.
Is it 5 standard deviations above the noise?
Situation 2: the function might describe the data in principle, but you didn't sample enough data points to allow a good fit
Suppose we had sampled the nerve that provided the data for armdata at low resolution, so that we only sampled every 2mm:
What happens if we perform the fit? (You can cut and paste here, this is a lot of the same concepts):
Q7: Compare the confidence intervals for parameter c and d for nerve3 and nerve5. Did the lower resolution fit have the right answer?
Number of parameters and model selection
One big question that one often wants to answer is the following: I have 2 functions that fit the data well. Which should I use? Can I make a statistical argument that one is better than the other?
Let's look at a specific example. Let's quickly re-do the fit of the census data from last time:
Q8: How does the rsquared of the third order polynominal fit compare to the second order polynomial?
In general, more parameters means a better fit. If we want to compare 2 fits that have the same number of parameters, we can just compare the squared errors (again, assuming both fits are not garbage by the criteria above).
So, suppose we have 2 fits that DON'T have the same number of parameters? How do we say which one is "better", discounting the extra number of free parameters?
In general, this is a pretty difficult question. There are a variety of approaches (see MacKay, "Information-based objective functions for active data selection", Neural Computation, 4:589-603).
One of the simplest is the "Nested F-test", which compares 2 "nested fits". The 2 fits above are considered "nested", because the 2nd is just an elaboration of the first with more parameters; we could get exactly the first fit if we set c and d equal to 0 above.
The formula for the Nested F test is the following: compute the value F, and then we'll examine whether the value of F is what we expect for the "same" fit quality, or if the reduced error of the second model is more than we expect for adding 2 additional parameters.
To do this, we need to examine the change in error but also the change in a quantity known as "degrees of freedom". Degrees of freedom, loosely speaking, is the number of data points we have minus the number of parameters; it's the the number of dimensions that remain "free" to vary after the parameters are specified.
We can obtain a P value for the nested F test by evaluating the cumulative density function for F, which is provided by the Matlab function fcdf (see help fcdf):
Q9: Is the 4 parameter fit significantly better than the 2 parameter fit?
There is a nice description of the nested F-test here.
Model selection is one of the more difficult tasks in fitting. It requires a lot of care, probably 3 times the care of fitting a single function! In the nested approach, if either of the fits "fails" due to the reasons we discussed above (getting stuck in local error minima, not enough points, the fit doesn't describe the data at all), then the output of this "Nested F test" is bogus. But it is useful when you need it and have spent the time establishing that your fits are of high quality.
Number of parameters and overfitting
In labs 1.5-1.6, we saw that we can never know the "true" distribution of our data, but we can learn some of its parameters by sampling. The process of sampling means that our data will deviate from the true distribution slightly, but we can infer certain properties of our distribution (like the mean, the amount of correlation between 2 variables, etc).
When we are fitting, we want to be careful to avoid loading up on parameters so that the fit runs through all of our data points, because this probably means we are fitting the particulars of the sample, rather than features of the "true" distribution. For example, consider the following fit (from Wikipedia::Overfitting):
The line approximately fits the data; some ambitious person has gone in and fit the data using a higher order polynomial (y = a + b*x + c*x^2 + d * x^3 + ... ). Notice, the higher order polynomial fit, which has many parameters, goes through every single data point and has 0 apparent error! It must be better, right? Well, what this person has probably done is to fit the sample, but has actually done a worse job of fitting the "true" distribution that we don't know. If we were to perform sampling a second time, the second fit would probably be much worse than the simple linear fit. So when it comes to number of parameters, remember, the fewer the better, generally speaking.
The tutorial is already pretty long, so we won't work any examples for overfitting. If you're interested, you can check out this video blog on the topic. It is short and has some pretty nice slides.
Matlab functions and operators
Labs are copyright 2011-2012, Stephen D. Van Hooser, all rights reserved.