Lab 3.1 Time series: correlation, feature detection, rates
Introduction
In the sciences, we often make observations of phenomena over time. Any record of a quantity measured over time is called a Time Series. We have seen a few time series already in this class; for example, the census data from the United States that we fit in Lab 2.1 is an example of a time series. Time series do not have be a series of single numbers; images taken over time can also be considered a type of time series.
With time series analysis, one can address a number of important questions. First, time series are a record that can be mined for relevant information. Second, we can examine temporal correlations (that is, correlations across time) between different quantities or even across the same quantity at different times. Third, we can use rapid sampling to identify important events, and calculate the time of occurrence and rates of those events. We will do all of these things in this lab.
Each time series has a minimum of 3 series of data: time, value, and index number
For this exercise, let's load in some example data. Over the first half of the lab, we'll examine temperature data taken over more than 100 years at various weather stations around the US and the world: at Boston Logan Airport, at the Blue Hills Reservation near Boston, at Death Valley in California, and at McMurdo research station in Antarctica (see map). These data are from the UK Met Office.
Let's load in the data file temperatures.mat (available for download below) and look at who are new variables are using the who command (if you don't know it, type help who):
load temperatures
who
Now let's plot the data from the Blue Hills Reservation:
figure(100);
plot(BlueHills(:,1),BlueHills(:,2),'b-');
xlabel('Time (years AD)');
ylabel('Temperature (degrees C)')
legend('Blue Hills');
Leave this figure open when you're done with it! We'll go back to it later...
Now take a minute to zoom around the data. There are lots of interesting things to notice. First, there are same data missing from the middle of the 2000's; I don't know why it's not there, but it's important to realize that time series are not required to be continuous, they can have breaks. Second, you probably notice that the data is highly periodic, as you'd expect. The temperatures go up and down over the years, reaching a local maximum in July/August and a local minimum in January February. Let's example the 3 series of this time series. So we aren't overwhelmed with data, let's just look at the first 10 points:
BlueHills(1:10,:)
You can see there are 2 columns of data. The first column is the time in years AD, and the second column is the temperature in degrees C. You probably expected this. There is also a third, implied (maybe obvious) column that corresponds to the sample index number. Let's make it explicit by writing it out:
[BlueHills(1:10,:) [1:10]' ]
If you haven't seen it before, the single apostrophe indicates the operation transpose, and converts the row vector 1:10 into a column vector. So there are 3 values for each data point:
the time it was recorded,
the value itself,
and the sample index number (often implied and not coded explicitly)
Suppose we wanted to calculate the average temperature in July for all of these years. We could do a search, using the find command, to find all data points that were measured on the half years (that is N.5). The function fix returns the integer part of a number. For example:
fix(1.5)
fix(10)
fix(10.3)
We could calculate the integer part (using fix) of all of the years in our data set, subtract it away, and then we'd be left with the fractional part of each year. Let's do this on a small set of data.
BlueHills(1:10,1)
fix(BlueHills(1:10,1))
BlueHills(1:10,1) - fix(BlueHills(1:10,1))
Now we can find all data points that are equal to 0.5 (or, are within 0.0001, in case there is any rounding error when I imported the data from UK Met Office):
remainders = BlueHills(:,1)-fix(BlueHills(:,1));
july_indexes = find( abs(remainders-0.5) < 0.0001);
Let's examine the first few index values:
july_indexes(1:10) % the index values
BlueHills(july_indexes(1:10),2) % the temperatures at those index values
To check to make sure we've found the correct index values, let's plot these data points as circles on our original graph (hopefully it is still open):
figure(100);
hold on
plot(BlueHills(july_indexes,1),BlueHills(july_indexes,2),'bo');
Q1: Zoom in on the graph. Are the index values in the correct places?
Now we can look at the distribution of July temperatures:
july_temps = BlueHills(july_indexes,2);
figure(101);
[xbh,ybh]=dasw.plot.cumhist(july_temps);
plot(xbh,ybh,'b-');
axis([10 40 0 100]);
xlabel('Mean temperature in July (C)');
ylabel('Fraction of years');
Q2: What's the median temperature in July over this data range at the Blue Hills observation point?
Correlation over time
In Lab 2.1, we considered the correlation between 2 quantities as quantified by the correlation coefficient r. We calculated the correlation between 2 variables x and y as follows:
We can also ask what knowing the value of our measured quantity at a time T tells us about that quantity at time T + tau. For example, what does knowing the temperature at a particular time tell us about the temperature in the next month? Let's say
x corresponds to temperature at a time t
y corresponds to temperature at time t + 1 month
Then we can pull out these data sets from our time series.
All of our available times t are as follows:
all_t = BlueHills(:,1);
Let's look at the first time point and see if we can find a corresponding point at t+1 month. But wait, you say..."the next month is just going to be located in the very next sample". Sure, this is true for a majority of the data in our dataset...but there are some gaps in the data. That is, there are some months for which we have no data. If we just assumed that adjacent samples always corresponded to recordings that were 1 month apart, then we would make several errors. So we have to search using something like the find command.
first_t = all_t(1)
inds = find(all_t == first_t + 1/12)
You might be surprised to see that this doesn't turn up anything. This is due to rounding, a problem we have to deal with when we work with experimental data. Let's look at the absolute value (using abs) of the differences between all times and the first time plus 1 month:
figure;
plot(abs(all_t-(first_t+1/12)),'-o');
xlabel('Sample index number');
ylabel('Difference from the first time point + 1/12')
Q3: Which sample number is closest to first_t + 1/12? Is the value that you plotted 0 or just very small? If it's very small, is it on the order of 10 to the minus what (10^(-?))
So we need to modify our search algorithm a little bit.
inds = find( abs(all_t-(first_t+1/12))< 0.0001)
Did it come up with the index value you expected?
So our first data point in x would be BlueHills(1,2) and the corresponding value in y would be BlueHills(inds,2):
BlueHills(1,2)
BlueHills(inds,2)
Now we can use a 'for loop' to create our samples x and y:
We want to find all available times that we also have a corresponding measure at time t + 1 month:
x = []; % start with an empty matrix
y = [];
for i=1:length(all_t),
target_time = all_t(i) + 1/12;
inds = find( abs(all_t-(target_time))< 0.0001);
if length(inds)>0, % if we have a match (because we might not)
x(end+1) = BlueHills(i,2);
y(end+1) = BlueHills(inds,2);
end;
end;
Now we can examine how much knowing the temperature at one month tells you about the temperature in the next month:
R = corrcoef(x,y)
Recall that the R(x,y) is in the upper right or lower left of this matrix:
R(1,2)
Q4: What is R(x,y)?
Correlograms and Autocorrelograms
One way to examine the correlation within a time series is to plot a correlogram. A correlogram examines the correlation coefficient at several time lags. Above, we calculated the correlation between the monthly temperature at the Blue Hills Reservation and the next month, but we could just as easily look at other time lags. The function correlogram.m extends the code above to loop over an array of user-specified time lags (Put it at [your UNET home]/MATLAB/+dasw/+stats/correlogram.m):
correlogram.m
function [R, SIG] = correlogram(t1, data1, t2, data2, lags, tolerance, alpha)
% CORRELOGRAM - Compute correlations at different time offsets
% [R,CONF] = dasw.stats.correlogram(T1,DATA1,T2,DATA2,LAGS,TOL,ALPHA)
%
% Returns in R the values of the correlation at different time
% lags specified in LAGS for the 2 time series with time values
% T1 and T2 and data values DATA1 and DATA2. 2 time values are
% considered to be the same if they are within TOL of one another.
%
% SIG is the value that R would need to exceed to be significant
% at level ALPHA at each lag location.
%
% Example:
% t = 0:0.01:10;
% y1 = sin(2*pi*t);
% y2 = cos(2*pi*t);
% R=correlogram(t,y1,t,y2,[-1:0.01:1],0.001,0.05);
%
for j=1:length(lags),
x = []; % start with an empty matrix
y = [];
for i=1:length(t1),
target_time = t1(i) + lags(j);
inds = find( abs(t2-(target_time))< tolerance);
if length(inds)>0, % if we have a match (because we might not)
if length(inds)>1, % if we have more than one match, pick the best
[minvalue,inds] = min(abs(t2-(target_time)));
end;
x(end+1) = data1(i);
y(end+1) = data2(inds);
end;
end;
if length(x)>2,
Rfull = corrcoef(x,y);
R(j) = Rfull(1,2);
% now significance
T2 = (tinv(alpha, length(x)-2))^2;
df = length(x) - 2;
SIG(j) = sqrt((T2/df)/(1+T2/df));
else, % no data, return NaN for this lag position
R(j) = NaN;
SIG(j) = NaN;
end;
end;
lags = [-1:1/12:1]; % look over an entire year in 1 month steps
tolerance = 0.0001;
alpha = 0.05;
[R,S]=dasw.stats.correlogram(BlueHills(:,1),BlueHills(:,2),BlueHills(:,1),BlueHills(:,2),...
lags,tolerance,alpha);
figure;
plot(lags,R);
hold on;
plot(lags,-S,'--');
plot(lags,S,'--');
xlabel('Time lag (years)');
ylabel('Correlation coefficient');
title('Correlogram: Blue Hills Reservation');
Here we have plotted the correlation coefficients along with a dashed line that indicates the "line of significance" with alpha 0.05. Any point that exceeds the line of significance has a likelihood of less than alpha of being due to sampling along.
Q5: If you know the mean temperature of a given month, is the temperature 6 months later likely to be similar or different? That is, is r positive or negative? Is it significant at level alpha?
When we examine the temporal correlation of a variable with itself, then the correlogram is called the autocorrelogram. But we can also examine the correlation between one variable and another.
Let's look at the correlation between temperatures in Boston and those in Antarctica. First, let's add the McMurdo station data to our plot.
figure(100);
hold on;
plot(McMurdo(:,1),McMurdo(:,2),'r-');
legend('Blue Hills','Blue Hills July','McMurdo');
Brrr...there are some hearty souls down there. Now let's look at the correlogram between Blue Hills and McMurdo:
[R2,S2]=dasw.stats.correlogram(BlueHills(:,1),BlueHills(:,2),McMurdo(:,1),McMurdo(:,2),...
lags,tolerance,alpha);
figure;
plot(lags,R2);
hold on;
plot(lags,-S2,'--');
plot(lags,S2,'--');
xlabel('Time lag (years)');
ylabel('Correlation coefficient');
title('Correlogram: Blue Hills * McMurdo');
Q6: Is McMurdo correlated or anti-correlated with Blue Hills? How many months "ahead" is McMurdo? Does the fact that the 2 locations have different mean temperatures matter for the correlogram?
For fun, more data:
figure(100);
plot(DeathValley(:,1),DeathValley(:,2),'y-');
legend('Blue Hills','Blue Hills July','McMurdo','Death Valley');
[R3,S3]=dasw.stats.correlogram(BlueHills(:,1),BlueHills(:,2),...
DeathValley(:,1),DeathValley(:,2),lags,tolerance,alpha);
figure;
plot(lags,R3);
hold on;
plot(lags,-S3,'--');
plot(lags,S3,'--');
xlabel('Time lag (years)');
ylabel('Correlation coefficient');
title('Correlogram: Blue Hills * Death Valley');
Time series with regular sampling
Often in science we will record data on the computer using a perfectly regular sampling interval (measured in seconds) or sampling rate (measured in samples/second, or "Hertz" Hz).
Download the file normalecg.txt (provided by Milwaukee School of Engineering, originally from MIT-BIH database). This is a recording of an electrocardiogram, a recording of the electrical activity of the heart from a skin electrode:
ecg = load('normalecg.txt');
size(ecg)
You'll notice that this file is only 1 dimensional; it has no associated time variable.
figure;
plot(ecg)
xlabel('Sample number');
ylabel('Voltage (mV)');
The sampling rate for this recording was 128 Hz. This means the sampling interval is 1/128 or 0.0078125 seconds.
Since the ecg is a time series, we still have 3 series of data: the sample times, the sample values, and the sample index numbers, but with regularly sampled data we often have to create the time portion of the data ourselves. This is not hard if we know the sampling rate (or, equivalently, the sampling interval):
SR = 128;
t = 0:1/SR:(length(ecg)-1)/SR;
This creates a time variable that starts at 0 and increases by 1 / SR until it gets to the right number of samples to match our data variable. Let's plot it again:
figure(102);
plot(t,ecg)
xlabel('Time (s)');
ylabel('Voltage (mV)');
Leave this figure open, we'll add to it in a minute.
Q7: Calculate the sampling interval for 60Hz, 120Hz, 10kHz
Identifying features and calculating rates
Sometimes when we study a time series we would like to pull out specific features, and find the time of their occurrence and perhaps their rate of occurrence. This is especially true for signals like the ECG or signals like neural recordings.
It is actually moderately difficult to extract pulse times from ECG signals or from neurons because in real life one has to be sure to reject electrical artifacts from touching the leads or other sources. Here in the class, we won't do this in a "professional" way, but we'll use a quick and dirty method that will work for illustrating the concept.
In the ECG data, all of the "spikes" exceed 2mV. We can identify the presence of a spike by looking for transitions between values that are less than 2mV to values that are greater than 2mV.
Let's unpack how we will do this. Suppose we have a signal and we want to find the locations where the signal transitions between values less than a threshold to values that are greater than a threshold. Let's look at a very short signal for illustration:
A = [ 0.4 0.2 0.3 1.2 1.5 1.3 1.4 ]
Let's suppose we want to write code to pull out the locations all of the transitions from values less than 1.0 to values greater than or equal to 1.0. Clearly, just be looking at the signal, we can see that the transition occurs between samples 3 and 4, such that 4 is the first sample above threshold. We can pull out all such locations by looking for these transitions:
A(1:end-1)<1 & A(2:end)>=1
Let's look at how this works. Let's look at the vectors individually:
A(1:end-1) % the data
A(1:end-1)<1 % are the points less than 1?
A(2:end) % the data
A(2:end)>=1 % are the points greater than or equal to 1?
So we can make comparisons of the N-1 transitions of a signal of length N, to see if the transition is a threshold crossing. We can return all such samples with
1+find(A(1:end-1)<1 & A(2:end)>=1)
Let's write a function to do this (you should put this in a new +signal package at [your UNET home]/MATLAB/+dasw/+signal/threshold_crossings.m):
threshold_crossings.m
function [ index_values ] = threshold_crossings( input, threshold )
%THRESHOLD_CROSSINGS Detect thershold crossings in data
%
% INDEX_VALUES = dasw.signal.threshold_crossings(INPUT, THRESHOLD)
%
% Finds all places where the data INPUT crosses the threshold
% THRESHOLD. The index values where this occurs are returned in
% INDEX_VALUES.
%
index_values = 1+find( input(1:end-1)<threshold & ...
input(2:end) >= threshold);
And let's use it to find each heart beat:
threshold = 2;
spike_samples = dasw.signal.threshold_crossings(ecg, threshold);
Let's examine the times that we've identified:
figure(102);
hold on;
plot(t(spike_samples),ecg(spike_samples),'bo');
Now zoom in on these little dots. Do they correspond properly to the spike onset time? (Note that here we're talking about the spike onset time, not necessarily the peak location of the spikes.)
We can estimate the average rate by dividing the number of spikes (or beats) that we've identified by the duration of the recording. The units of heart rate are often expressed in beats per minute, so we'll multiply by 60 to obtain units of beats / minute.
avg_heart_rate = 60 * length(spike_samples)/(t(end)-t(1))
Just for fun, let's use the text (see help text) command to plot this rate on our graph:
text(40,3,['Rate = ' num2str(avg_heart_rate,2)]);
We can also plot the instantaneous rate by plotting the beat-by-beat rates. We can do this in the following manner:
spike_times = t(spike_samples);
spike_diffs = spike_times(2:end) - spike_times(1:end-1); % beat-by-beat time diffs
spike_instantaneous_rates = 1./spike_diffs;
The instantaneous rates involve the difference in time between each spike and the one before it, so we don't know the instantaneous rate of the first spike (we didn't record its previous spike). So we have to leave it out of the graph:
figure;
plot(spike_times(2:end),60*spike_instantaneous_rates);
ylabel('Instantaneous heart rate');
xlabel('Time(s)');
Q8: What is the standard deviation (remember std ?) of the instantaneous rates? Is the heart rate relatively constant across the recording?
Next time we'll dive further into sampling and filtering.
Function reference
Matlab functions and operators
User functions
dasw.stats.correlogram.m
dasw.signal.threshold_crossings.m
Labs are copyright 2011-2021, Stephen D. Van Hooser, all rights reserved.