For reasons I no longer remember I decided a while back to reproduce Figure 1.5.2 from Casella and Berger (page 32):
It's a plot of the cumulative distribution function of a geometric distribution with p = 0.3. The geometric distribution is best explained with coin-flipping. Let's say we keep flipping a coin until we observe a head. Now let X be a random variable that equals the number of tosses required to see a head. So if we get a head on the first flip of the coin, that means we needed one toss. If it takes two tosses, we got a tail on the first toss and then a head on the second toss. The graph above visualizes the possibilities for 15 flips with probability of getting heads set to 0.3. We see that the probability of taking only one toss is 0.3, which makes sense since the probability of getting heads is 0.3. The probability of requiring two tosses is about 0.5. And so on. The straight lines indicate the probabilities only change at the whole numbers. There is no such thing as, say, 2.23 flips. This is often referred to as a step function. We see that there's a high probability of getting a head by the 5th or 6th flip. This is a simple waiting-time distribution that can be used to model the number of successes before a failure and vice-versa.
To recreate this plot in R I used the pgeom function, which returns the cumulative probability of the geometric distribution for a given number of failures before a success occurs and a given probability of success. For example, the probability of 0 failures before observing a success with p = 0.3:
pgeom(0, prob = 0.3)
##  0.3
The probability of 1 failure before observing a success with p = 0.3:
pgeom(1, prob = 0.3)
##  0.51
To generate all the probabilities in the plot, we can do the following:
y <- pgeom(q = 0:14, prob = 0.3)
Notice that instead of using
1:15 (number of total flips for success), we use
0:14. That's because pgeom works with the number of failures before success. Now that we have our y coordinates, we can create our first version of the plot as follows:
plot(x = 1:15, y, pch = 19)
Now let's add the horizontal lines using the
plot(x = 1:15, y, pch = 19) segments(x0 = 0:15, y0 = c(0, y), x1 = 1:16, y1 = c(0, y), lwd = 2)
y0 coordinates are where the line starts. The
y1 coordinates are where the line ends. Since the lines are horizontal, the y coordinates are the same for the start and end postions. The y coordinates include 0, so we add that value to
y with the
c function. The
lwd = 2 argument makes the line a little thicker and darker.
Our plot is basically done, but just for fun I wanted to see how close I could make it look like the version in the book. That means relabeling the axes, moving the axis labels to the ends, and removing the lines at the top and right side of the plot. It also means moving the axis tick marks inside the plotting area. After some trial and error and deep dives into R documentation, here's what I was able to come up with:
plot(x = 1:15, y, pch = 19, yaxt = "n", xaxt = "n", ylim = c(0,1.05), xlim = c(0,15.5), bty="l", xaxs="i", yaxs = "i", xlab = "", ylab = "") segments(x0 = 0:15, y0 = c(0, y), x1 = 1:16, y1 = c(0, y), lwd = 2) axis(side = 1, at = 0:15, labels = 0:15, tcl = 0.5, family = "serif") axis(side = 2, at = seq(0,1,0.1), labels = c(0,paste(".",1:9),1), las=1, tcl = 0.5, family = "serif") mtext(text = expression(italic(x)), side = 4, at = 0, las = 1, line = 0.5, family = "serif") mtext(text = expression(italic(F[x](x))), side = 3, at = 0, line = 0.5, family = "serif")
yaxt = "n"and
xaxt = "n"arguments say “don't label the axes”. I instead use the
axisfunction to create the axes.
ylim = c(0,1.05)and
xlim = c(0,15.5)arguments tell the axes to end at 1.05 and 15.5, respectively. I wanted them to extend beyond the last value just as they do in the book.
bty="l"argument says “draw a box around the plot like a capital letter L”
yaxs = "i"arguments ensures the axes fit within the original range of the data. The default is to extend the range by 4 percent at each end. Again, I was trying to recreate the graph in the book. Notice how the origin has the x-axis and y-axis 0 values right next to one another.
xlab = ""and
ylab = ""set blank axis labels. I instead use the
mtextfunction to add axis labels.
segments function remained unchanged.
axis function allows us to explicitly define how the axis is created.
sideargument specifies which side of the plot we're placing the axis.
1is the bottom,
2is the left.
atis where we draw the tick marks.
labelsare how we label the tick marks.
tclargument specifies how long to make the tick marks and in what direction. A positive value extends the tick marks into the plotting region.
lasargument in the second
axisfunction makes the labels on the y-axis horizontal.
Finally I used the
mtext function to create the axis labels.
mtext writes text into the margins of a plot and can take some trial and error to get the placement of text just so.
textargument is what text we want to place in the graph. In this case I make use of the
expressionfunction which allows us to create mathematical notation. For example, the syntax
expression(italic(F[x](x)))returns \(F_x (x)\)
sideargument again refers to where in the plot to place the text.
3is top and
4is right. This means the y-axis label is actually in the top of the plot and the x-axis label is on the right. A little bit of a hack.
atsays where to place the text along the axis parallel to the margin. In both cases we use 0. We want the y-axis label at the 0 point corresponding to the x-axis, and the x-axis label at the 0 point corresponding to the y-axis. A little confusing, I think.
lasargument rotates the x label to be horizontal
lineargument specifies on which margin line to place the text, starting at 0 counting outwards. This is one that takes some trial and error to get just right.
familyargument specifies the type of font to use. “serif” is like Times New Roman.
Not perfect, but close enough. Of course I much prefer the R defaults when it comes to plotting layout. Even though R allows us to recreate this graph, I don't think it's necessarily a “good” graph.
I also decided to tackle this using ggplot. Here's how far I got.
library(ggplot2) dat <- data.frame(x = 1:15, y = pgeom(q = 0:14, prob = 0.3)) dat2 <- data.frame(x = 0:15, y = c(0, dat$y[-16]), xend = 1:16, yend = c(0,dat$y[-16])) ggplot(dat, aes(x = x, y = y)) + geom_point() + geom_segment(aes(x = x, y = y, xend = xend, yend = yend), data = dat2) + scale_x_continuous(breaks = 0:15, labels = 0:15) + scale_y_continuous(breaks = seq(0,1,0.1), labels = c(0,paste(".",1:9),1)) + labs(x = "x", y = expression(italic(F[x](x)))) + theme(panel.background = element_blank(), axis.line.x = element_line(color = "black"), axis.line.y = element_line(color = "black"), axis.title.x = element_text(face = "italic", hjust = 1), axis.title.y = element_text(face = "italic", hjust = 1, angle = 0))
You can see I couldn't figure out how to move the axis ticks into the plotting region, or how to place axis labels at the ends of the axis, or how to get the origin to start at precisely (0, 0). I'm not saying it can't be done, just that I lost interest in trying to go further. And a very strong argument can be made that these are things you shouldn't do anyway! But as I said at the outset, this was all for fun.