Category Archives: Simulation

Explaining and simulating an F distribution

I remember felling very uncomfortable with the F distribution when I started learning statistics. It comes up when you’re learning ANOVA, which is rather complicated if you’re learning it for the first time. You learn about partitioning the sum of squares, determining degrees of freedom, calculating mean square error and mean square between-groups, putting everything in a table and finally calculating something called an F statistic. Oh, and don’t forget the assumptions of constant variance and normal distributions (although ANOVA is pretty robust to those violations). There’s a lot to take in! And that’s before you calculate a p-value. And how do you do that? You use an F distribution, which when I learned it meant I had to flip to a massive table in the back of my stats book. Where did that table come from and why do you use it for ANOVA? I didn’t know, I didn’t care. It was near the end of the semester and I was on overload.

I suspect this happens to a lot of people. You work hard to understand the normal distribution, hypothesis testing, confidence intervals, etc. throughout the semester. Then you get to ANOVA sometime in late November (or April) and it just deflates you. You will yourself to understand the rules of doing it, but any sort of intuition for it escapes you. You just want to finish the class.

This post attempts to provide an intuitive explanation of what an F distribution is with respect to one-factor ANOVA. Maybe someone will read this and benefit from it.

ANOVA stands for ANalysis Of VAriance. We’re analyzing variances to determine if there is a difference in means between more than 2 groups. In one-factor ANOVA we’re using one factor to determine group membership. Maybe we’re studying a new drug and create three groups: group A on 10 mg, group B on 5 mg, and group C on placebo. At the end of the study we measure some critical lab value. We take the mean of that lab value for the three groups. We wish to know if there is a difference in the means between those three population groups (assuming each group contains a random sample from three very large populations).

The basic idea is to estimate two variances and form a ratio of those variances. If the variances are about the same, the ratio will be close to 1 and we have no evidence that the populations means differ based on our sample.

One variance is simply the mean of the group variances. Say you have three groups. Take the variance of each group and then find the average of those variances. That’s called the mean square error (MSE). In symbol-free mathematical speak, for three groups we have:

MSE = variance(group A) + variance(group B) + variance(group C) / 3

The other variance is the variance of the sample means multiplied by the number of items in each group (assuming equal sample sizes in each group).  That’s called mean square between groups (MSB). It looks something like this:

MSB = variance(group means)*(n)

The F statistic is the ratio of these two variances: F = MSB/MSE

Now if the groups have the same means and same variance and are normally distributed, the F statistic has an F distribution. In other words, if we were to run our experiment hundreds and hundreds of times on three groups with the same mean and variance from a normal distribution, calculate the F statistic each time, and then make a histogram of all our F statistics, our histogram would have a shape that can be modeled with an F distribution.

That’s something we can do in R with the following code:

# generate 4000 F statistics for 5 groups with equal means and variances
Fstat <- c()
for (i in 1:4000){
    g1 <- rnorm(20,10,4)
    g2 <- rnorm(20,10,4)
    g3 <- rnorm(20,10,4)
    g4 <- rnorm(20,10,4)
    g5 <- rnorm(20,10,4)
    mse <- (var(g1)+var(g2)+var(g3)+var(g4)+var(g5))/5
    M <- (mean(g1)+mean(g2)+mean(g3)+mean(g4)+mean(g5))/5
    msb <- ((((mean(g1)-M)^2)+((mean(g2)-M)^2)+((mean(g3)-M)^2)+((mean(g4)-M)^2)+((mean(g5)-M)^2))/4)*20
    Fstat[i] <- msb/mse
}
# plot a histogram of F statistics and superimpose F distribution (4,72)
h <- hist(Fstat,plot=FALSE)
ylim <- (range(0, 0.8))
x <- seq(0,6,0.01)
hist(Fstat,freq=FALSE, ylim=ylim)
curve(df(x,4,95),add=T)

So I have 5 normally distributed groups each with a mean of 10 and standard deviation of 4. I take 20 random samples from each and calculate the MSE and MSB as I outlined above. I then calculate the F statistic. And I repeat 4000 times. Finally I plot a histogram of my F statistics and super-impose a theoretical F distribution on top of it. I get the resulting figure:

This is the distribution of the F statistic when the groups samples come from Normal populations with identical means and variances. The smooth curve is an F distribution with 4 and 95 degrees of freedom. The 4 is Number of Groups - 1 (or 5 - 1). The 95 is from Total Number of Observations - Number of Groups (or 100 - 5). If we examine the figure we see that we most likely get an F statistic around 1. The bulk of the area under the curve is between 0.5 and 1.5. This is the distribution we would use to find a p-value for any experiment that involved 5 groups of 20 members each. When the populations means of the groups are different, we get an F statistic greater than 1. The bigger the differences, the larger the F statistic. The larger the F statistic, the more confident we are that the population means between the 5 groups are not the same.

There is much, much more that can be said about ANOVA. I'm not even scratching the surface. But what I wanted to show here is a visual of the distribution of the F statistic. Hopefully it'll give you a better idea of what exactly an F distribution is when you're learning ANOVA for the first time.

Stabilization simulation of relative frequency

One of my favorite simulations is seeing how the relative frequency of an event stabilizes around a theoretical probability as the number of trials increases. For example, we say the probability of rolling a 5 when rolling a single die is \frac{1}{6}. Although it’s an obvious probability, it’s still fun to simulate 10,000 rolls of a die and see that indeed the relative frequency of the occurrence of 5’s stabilizes around \frac{1}{6}. When this concept is presented in a textbook it is often accompanied by a squiggly line graph that settles down around a straight line that represents the theoretical probability. That’s what I want to show how to do in this post, except using a dice example that does not have an obvious probability.

Let’s say we roll a single fair Las Vegas die six times and define success as follows:

  • if the number on the die matches the number of the roll at least once, we have success. (So if it’s our second roll and we roll a 2, we have success.)
  • If we go all six rolls with no result matching the order of the roll, we have failure.

While each roll of the die in 6 tries has a probability of \frac{1}{6} of matching the attempt number, there is no intuition for determining the probability of success as we defined it above. Theoretically we can calculate it easily enough. We note the probability of NOT rolling the same die value as attempt number is \frac{5}{6}. Furthermore, the probability of NOT rolling the same die value as attempt number for all six tries is (\frac{5}{6})^{6}. Now if we subtract (\frac{5}{6})^{6} from 1, we have the probability of rolling the same die value as attempt number at least once. So our probability is 1-(\frac{5}{6})^{6}=0.665102. That’s pretty high. Even though success on a single roll has probability of \frac{1}{6}, the probability we get at least one success in 6 rolls is about \frac{2}{3}. The idea that we have a better than even chance of seeing a success in 6 rolls may be intuitive, but the theoretical probability of 0.665012 is certainly not intuitive.

We can simulate this experiment many times using a computer and check whether or not the relative frequency of success converges to the theoretical probability we calculated. We can then make one of those cool line graphs and see if the relative frequency settles down around 0.665. Here’s how I did it in R:

 

p <- rep(NA,1000)
s <- rep(NA,1000)
    for (j in 1:1000){
        ex <- rbinom(1,6,(1/6)) #run the experiment
        s[j] <- ifelse(ex>=1,1,0) #success or failure?
        p[j] <- sum(s,na.rm = TRUE)/j
}

 

The first two lines create empty vectors to store the relative frequency (p) and whether or not we saw success in each trial (s). I then execute a for loop with 1000 iterations. Each time through the loop I run the experiment using the rbinom function, I check to see if I have success using the ifelse function (storing 1 if "yes" and 0 otherwise), and then calculate the relative frequency (p). Notice that the relative frequency uses information from the previous trials in each iteration. For example, I could see a sequence such as...

  1. roll six times, get a success: p = 1/1
  2. roll six times, get a success: p= 2/2
  3. roll six times, no success: p = 2/3
  4. roll six times, get a success: p = 3/4
  5. roll six times, get a success: p = 4/5

Each time the denominator increases and matches the trial number. The numerator is always the previous numerator plus 1 if there was a success, or plus 0 if no success.

Once you run the simulation, you can graph it with the following:

 

n <- 1:1000
plot(n,p,type="l",main="convergence to probability as n increases")
lines(c(1,1000),c(0.665, 0.665))

 

The first line creates my x-axis values. The second line makes the plot. The third line adds a horizontal line in the graph to represent the theoretical probability. Here's how my graph turned out:

Click on it to see the graph in full-size glory.

Simulating outcomes of a simple experiment

One of the first things you learn in an introductory mathematical statistics class is how to assign a probability model to the outcomes of a simple experiment. For example, say we have an urn containing the following:

  • 2 blue marbles
  • 3 green marbles
  • 3 red marbles
  • 2 yellow marbles

The experiment is to select one marble at random and note the color. Our intuition tells us to assign probabilities as follows:

  • P(blue) = \frac{2}{10}
  • P(green) = \frac{3}{10}
  • P(red) = \frac{3}{10}
  • P(yellow) = \frac{2}{10}

If we code our colors as 1=blue, 2=green, 3=red and 4=yellow, then our probability model is given by the probability mass function (p.m.f):

f(1)=\frac{2}{10}, f(2)=\frac{3}{10}, f(3)=\frac{3}{10}, f(4)=\frac{2}{10}

We can construct a probability histogram of this p.m.f. using R as follows:

 

ht <- rep(c(1,2,3,4),c(2,3,3,2))
hist(ht,freq=F,breaks=c(0.5,1.5,2.5,3.5,4.5))

 

This gives us the nice histogram:

If we were to actually carry out this experiment a number of times and make a histogram, we would expect to see a similar shape. This can be easily simulated on a computer. Unfortunately many textbooks don't actually tell you how to do this. Probably because doing so would mean documenting some method using some program that will likely change or be obsolete at some point. But here I show you how it can be done in R. And I can't imagine this method will go out of date any time soon.

The easy part is the simulation. One call to the rmultinom function does the trick:

 

rmultinom(1, 500, c(2,3,3,2))

 

The 1 means I'm drawing one item from the four categories containing 2, 3, 3, and 2 elements each. The 500 means I do it 500 times. The c(2,3,3,2) is a vector representing the number of colored marbles I have in each category. I also could have used the probabilities: c (0.2,0.3,0.3,0.2).

When I call this function I need to remember the results, so I give it a name, "h":

 

h <- rmultinom(1, 100, c(2,3,3,2))

 

"h" stores a matrix of numbers:

 

h
    [,1]
[1,] 100
[2,] 159
[3,] 141
[4,] 100

 

These are the results of the 500 experiments. I picked 100 blues, 159 greens, 141 reds and 100 yellows. Now to make the histogram. What I choose to do is create a vector of 500 items using the codes I assigned above (1=blue, 2=green, 3=red, and 4=yellow). I can do this using the rep() function as follows:

 

hx <- rep(c(1,2,3,4),c(h[1,1],h[2,1],h[3,1],h[4,1]))

 

The first part - c(1,2,3,4) - says I want to repeat the numbers 1, 2, 3, and 4. The second part - c(h[1,1],h[2,1],h[3,1],h[4,1]) - indicates how many times to repeat each number. h[1,1] references row 1, column 1 of the matrix called "h"; h[2,1] reference row 2, column 1 of the matrix; and so forth. Now I can use the hist function to create our histogram:

 

hist(hx,freq=F,breaks=c(0.5,1.5,2.5,3.5,4.5))

 

Here's the result:

As expected, it looks pretty close to the theoretical histogram we created first. Again, nothing mind-blowing in this post. But I did want to document one way to simulate these simple experiments we so often see early in our statistical studies.

Explaining and simulating a chi-square distribution

Randomly select several values, say 10, from a standard Normal distribution. That’s a Normal distribution with a mean of 0 and standard deviation of 1. Now square each value and sum them all up. That sum is an observation from a chi-square distribution with 10 degrees of freedom. If you picked 12 values and did the same, you would have an observation from a chi-square distribution with 12 degrees of freedom.

We can do this in R as follows:

# random sample of 10 values from N(0,1) summed up
s <- rnorm(10)
sum(s^2)

When I did it, I got the following:

s <- rnorm(10)
s
[1] 2.3235602  0.1339985  0.5645995  1.3603184 -1.1383273  0.1735691
[7] 0.9075259  0.2842374 -0.4214184 -0.1069466
s^2
[1] 5.39893181 0.01795559 0.31877258 1.85046616 1.29578902 0.03012623
[7] 0.82360328 0.08079092 0.17759345 0.01143758
sum(s^2)
[1] 10.00547

Now let's repeat this 1000 times and make a histogram:

x <- rep(NA, 1000)
for (j in 1:1000){
x[j] <-sum(rnorm(10)^2)}
hist(x)

You can click on the image to see bigger version. Obviously we don't see the Normal-looking bell-shape. It skews right. Also notice that since we squared all our values, every sum is positive, which means the histogram starts at 0. Now let's plot a chi-square curve with 10 degrees of freedom on top of our histogram:

h <- hist(x, plot=F)
ylim  <- range(0, h$density, 0.10)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=10), add = TRUE)

I had to experiment a little to get the 0.10 value for the range statement. That was necessary to do in order to keep R from setting the displayed range of the y axis from being to small.

The formula for the chi-square curve (i.e., the probability density function) is

f(x)=\frac{1}{\Gamma(r/2)2^{r/2}}x^{r/2-1}e^{-x/2}

where r is degrees of freedom and x must be positive. And, yes, I think it looks pretty frightening if you're not used to that sort of thing.