# Lab 1.5 Did my drug really work? Statistical inference

## Introduction:

### Results of PS1.1, Q4.1:

In PS1, Q4.1, you were asked to guess whether 2 samples that you examined were truly different, or if they were just 2 random samples of the same underlying distribution. The responses varied...most people thought about half of trials were truly different. Because this is an analysis class, we will present the results in histogram form:

## Philosophy of sampling / statistical inference:

So, if we were to perform the same experiment twice, we always get different numbers due to the fact that we are **sampling**. The question of did my drug work is therefore equivalent to:

"Did I merely sample the same distribution twice, or is there some difference to these distributions?"

In our example: "Did I merely sample the number of years of survival of 2 groups of people that are representatives (samples) of the same population, or has the drug changed the underlying distribution of the survival times of those who took it such that the 2 samples are drawn from 2 different distributions?"

Experimental scientists who are trying to understand the world use the following model (either implicitly or explicitly):

The **true distribution** (my term) is the quantity that we would ideally like to know in order to answer our scientific question. However, we cannot know it exactly because we are physically limited to taking measurements (samples) in order to learn about the world. Sometimes we cannot know this true distribution because it would be impractical to measure every example. The distribution of "survival times following a heart attack" is such an example; it is impractical (and statistically unnecessary) to follow every heart attack victim from the time of the attack until their death, which could be years later. Or, it may not be possible to measure the true distribution due to experimental noise. But often the true distribution is an abstract thought; we might want to know "the distribution of survival times following a heart attack without administering a drug of interest" and "the distribution of survival times following a heart attack with the administration of the drug of interest". These distributions are abstract thoughts, because we cannot both administer and not administer the drug to all individuals; instead, we want to know what these distributions *would* look like if we were to perform one action or the other.

Fortunately, we can learn most of what we want to know about the true distribution using sampling and **inference**. Most theories of statistics assume that samples are **simple random samples**; that is, *each data point that comprises the true distribution has an equal chance of being part of the sample*.

Inference

In this figure, left panel shows the "true" distribution of some random process we might want to measure. The center shows a cumulative histogram for 30 samples of this process. We can never know the "true" distribution exactly by only sampling, but by using inference we can estimate a what that distribution might be within a certain range (gray area at right). In this lab we will focus on examining whether 2 distributions are different.

## Our first example of statistical inference: Did my drug do *anything*?

*anything*?

In the first few labs, we have looked at examples of pairs of samples from trials involving a "drug" and a "placebo". For any given experiment, are these 2 distributions different? Of course, the values in the 2 samples are different, but do we expect that the 2 **true distributions** are different?

Statisticians address this by studying a related question:

Imagine we sampled the same true distribution twice; what would the differences in the samples look like?

Let's do this and examine the answer. We will draw 2 samples from the same distribution using our function generate_random_data. We'll plot the data and look at the maximum difference in the cumulative histograms (in absolute value). (We are examining this particular quantity because some mathematicians found something interesting about it, as we'll see soon.)

First, let's make sure your path is set up properly by checking if Matlab knows where to find generate_random_data:

which generate_random_data

If you still have it on your path, you should see where it is located. Otherwise, make sure you have added it to your path using addpath as in Lab 1.4.

Now let's create 2 normally-distributed samples of 20 data points with mean 0 and standard deviation 1:

sample1 = generate_random_data(20,'normal',0,1);

sample2 = generate_random_data(20,'normal',0,1);

[X1,Y1] = cumhist(sample1,[-10 10],0.1);

[X2,Y2] = cumhist(sample2,[-10 10],0.1);

figure(90);

plot(X1,Y1,'bo-');

hold on;

plot(X2,Y2,'ro-');

xlabel('Value');

ylabel('Percent of samples');

We want to calculate the maximum difference between these 2 cumulative histograms. To do this, we will use the function we built last time, cumulative_hist_diff.

To remind yourself of what the function does, take a look at the help information:

help cumulative_hist_diff

Now let's plot the difference:

[maxdiff,maxdiff_loc,Xval,sample1CDF,sample2CDF] = cumulative_hist_diff(sample1,sample2);

figure(90);

Xvalues = [ Xval(maxdiff_loc) Xval(maxdiff_loc) ];

Yvalues = 100 * [ sample1CDF(maxdiff_loc) sample2CDF(maxdiff_loc) ];

plot(Xvalues,Yvalues,'k-','linewidth',2);

maxdiff,

Suppose we were to repeat this process of drawing 2 samples from the same true distribution over and over again. What would the (sampled) distribution of maximum differences between sample1 and sample2 look like?

## Repeating the same operation over and over again with while loops

The need to repeat something over and over again arises often in computing. Loops allow us to repeat tasks until we are finished. One type of loop is the **while** loop, which is written as follows:

while (*true or false expression is true*)

*statement1*;

*statement2*;

...

end;

A true or false expression in Matlab is an expression that either returns 0 or 1; 0 corresponds to *false*, and 1 corresponds to *true*. For example, we can use the **relational operations***less than* (<), *equals* (==), or *greater than* (<) to ask whether a new variable a is less than, equal to, or greater than 5:

a = 4;

a<5,

a==5,

a>5,

Note that the relational operator *equals* is written with 2 equals signs (==); this is to distinguish from the operation of *assignment*, which is denoted by 1 equals sign(=). a=5 means "assign to a the value 5", whereas a==5 means "test if the variable a is equal to 5". You can see more relational operators in Matlab by clicking here.

Let's create a simple while loop that increments a variable until it reaches a certain value.

i = 1;

while (i<10)

i,

i = i + 1;

end

This prints the value of i as the loop runs. The loop stops when i is equal to 10 (and the expression i<10 becomes *false*).

Let's use a while loop to make many samples of the difference between our 2 sampled distributions above:

md = [];

i = 1;

while (i<1001),

sample1 = generate_random_data(20,'normal',0,1);

sample2 = generate_random_data(20,'normal',0,1);

maxdiff = cumulative_hist_diff(sample1,sample2);

md(i) = maxdiff;

i = i + 1;

end;

At the end of this loop, we will have filled md with 1000 samples. Check it and see:

size(md),

Now let's examine the cumulative histogram of the samples md:

[Xmd,Ymd] = cumhist(md,[0 1],0.01);

figure(100);

plot(Xmd,Ymd,'k-');

xlabel('Max diff b/w 2 samples of size 20');

ylabel('Percent of data');

**Q1**: What is the median difference? What is the difference value encountered that is equal to the 95%-tile? The 99%-tile? You may have to use the "zoom" button on the figure to zoom in to read off some of these values.

## Examining the difference between 2 samples that really do arise from 2 distributions

Now let's generate 2 samples that are derived from 2 different distributions. We will generate 1 data set with mean 0 and a second with mean 2.

sample1 = generate_random_data(20,'normal',0,1);

sample2 = generate_random_data(20,'normal',2,1);

[X1,Y1] = cumhist(sample1, [-10 10], 0.1);

[X2,Y2] = cumhist(sample2, [-10 10], 0.1);

maxdiff = cumulative_hist_diff(sample1,sample2);

Now let's plot cumulative histograms of these 2 samples on our figure, and also plot the maximum difference between the cumulative histograms on our previous graph of differences where the samples were drawn from the same underlying distribution:

figure(101);

plot(X1,Y1,'r-');

hold on

plot(X2,Y2,'b-');

xlabel('values of 2 samples of size 20');

ylabel('Percent of data');

figure(100);

hold on;

plot([maxdiff maxdiff],[0 100],'g-'); % we'll plot a green vertical bar

(When you're finished, leave the figure open; we're going to work with it a little more later.)

**Q2**: Now examine where the maximum difference between these 2 samples lies relative to the distribution of 1000 maximum differences that we generated above. That is, at what percentile (Y axis location on the graph of cumulative maximum differences md) is the vertical bar? (In other words, where on the Y axis does the black line intersect the green vertical bar?) Compute (100% - *this percentile)* to obtain the answer to our question:

*what is the likelihood that these 2 samples of 20 are samples of the same underlying distribution*?

## We can often calculate the likelihood that 2 or more sets of samples came from the same underlying distribution

Mathematicians can calculate, for various assumptions, the likelihood of statements like the one above: *are these 2 samples generated from the same underlying distribution*? In 1933, the mathematician Andrey Kolmogorov published a proof of an equation that estimates the probability that 2 samples were generated by the same distribution. In 1948, N.V. Smirnov published tables of solutions Kolmogorov's equation. Since computing was harder in those days, the "test" that resulted is named after both mathematicians: the **Kolmogorov-Smirnov test**. Here, let's not worry too much about the formula of the equation, but let's focus a lot on the *assumptions* and the *expected cumulative histogram* (read slowly, probably many times):

### Kolmogorov-Smirnov Assumptions

Suppose sample1 and sample2 are empirical samples of the same underling distribution, and n1 is the number of samples in sample1, and n2 is the number of samples in sample2, and D is the difference between the cumulative histograms of sample1 and sample2. Then, the *expected cumulative histogram* of D, which is expressed as **P(x≤D)** (read "the probability that a value x on the x axis is less than the value of D) is as follows:

This is some math that we don't need to follow in detail; the thing to know is that the answer we get from it is the expected cumulative histogram **P(x≤D)**. It's not too difficult to calculate the cumulative histogram P(x≤D), but I don't want to distract us from the main lesson at hand, so I've included a function called **ks2_cdf** that computes the cumulative histogram (more commonly known by the term **cumulative density function**, or **cdf**). Download this function by clicking the link above, and put it here:

*[your UNET home]*/MATLAB/tools/stats/ks2_cdf.m

Now let's try it out:

help ks2_cdf

x = 0:0.01:1; % create the X points

n1 = 20;

n2 = 20;

math_cdf = ks2_cdf(n1, n2, x);

math_cdf_percent = 100*math_cdf; % convert to percent rather than fraction

figure(100);

hold on;

plot(x,math_cdf_percent,'k--'); % black dashed line

**Q3**: Does the mathematically-computed curve have the same shape as the one you created in your repeated experiments with the same underlying distribution?

**Q4**: Find the location on the Y axis (that is, the percentage) where the mathematically-computed curve overlaps the maximum difference value for the 2 samples that truly are different (the green vertical bar in figure 100). Calculate *100% - this probability* to get the mathematically-computed likelihood that these 2 samples are derived from the same distribution.

## Hypothesis testing

We've just performed a major task in statistical inference: we calculated the probability that 2 samples were generated by the same distribution. This probability was very low when we were dealing with 2 distributions that were actually different (because we constructed them that way).

But you don't want to tell the newspapers "the probability that my drug did not have an effect is 0.003"; the readers would rather just read whether or not the drug had a * significant* effect. This type of statement requires a statistical test of a hypothesis.

The standard method of performing such a hypothesis test is a little backward, but it is the following:

Suppose the drug had no effect; this is called the

**null hypothesis**because it is a negative hypothesisCalculate the probability that the null hypothesis is true (this is just the probability that the 2 samples were drawn from the same distribution that we calculated both empirically and mathematically above)

If this probability is less than a cut-off factor

**alpha**(the most common choice is 0.05), then we say that**we reject the null hypothesis with a confidence alpha**; that is, the results are significantIf this probability is not less than the cut-off factor alpha, then we cannot reject the null hypothesis; that is, we don't have strong evidence that the drug did anything.

You might ask:

*Isn't this cut-off factor alpha rather arbitrary?*Yes, it is.*Can I use a different cut-off factor alpha for rejecting the null hypothesis?*If you want to be more conservative, you can use a less selective cut-off factor, such as 0.01 or 0.001. Using anything smaller than 0.001 is pretty silly, because most of the tests don't accurately estimate the probability that the null hypothesis is true when p<0.001. In fact, many journals such as Nature will not let scientists report P values less than 0.001 (the authors must report "P<0.001").*If most papers use an alpha of 0.05, which is 1/20, does this mean that approximately 1/20 studies that report a significant result are likely to be reporting differences that actually occurred by chance?*Yes, many papers are likely to be reporting differences that only arise by chance. (The number of erroneous conclusions is probably less than 1/20, since many experiments report P values that are much smaller than 1/20, but by the nature of the process there must be some erroneous conclusions.)

We'll play around with this more in homework.

## Function reference

### Matlab functions and operators

**while**(loop)

### User functions

**cumulative_hist_diff**(you wrote as part of Lab 1.4)**ks2_cdf**

**Copyright 2011-2012 Stephen D. Van Hooser, all rights reserved.**