Mosaic plots provide a way to visualize contingency tables. However they’re not as intuitive as, say, a scatter plot. It’s hard to mentally map a contingency table of raw counts to what you’re seeing in a mosiac plot. Let’s demonstrate.
Here I use data from Example 8.6-3 of Probability and Statistical Inference (Hogg & Tanis, 2006). These data are a random sample of 400 University of Iowa undergraduate students. The students were classified according to gender and the college in which they were enrolled.
M <- matrix(data = c(21,14,16,4,145,175,2,13,6,4), ncol=5, dimnames = list(Gender=c("Male","Female"), College= c("Business","Engineering","Liberal Arts","Nursing","Pharmacy"))) M
## College ## Gender Business Engineering Liberal Arts Nursing Pharmacy ## Male 21 16 145 2 6 ## Female 14 4 175 13 4
To create a basic mosaic plot in R, you use the
mosaicplot function with a contingency table as the first argument. This can be a
matrix object. I’ve also added a title to the graph and used the
las argument to make the axis labels horizontal.
The main feature that likely jumps out to you is the lack of numbers. We just see rectangles stacked on one another. If we compare the mosaic plot to the table of counts, the size of the boxes seem related to the counts in the table. Indeed they are, but how? Another feature you may notice is that the widths of the rectangles do not vary but the heights do. What’s up with that?
By default, the
mosaicplot function recursively calculates marginal proportions starting with the rows. In our example, we start with gender:
apply(M, 1, function(x)sum(x)/sum(M))
## Male Female ## 0.475 0.525
That simply says 0.475 of our sample is male, and 0.525 is female. These are the widths of the rectangles.
Now within gender, calculate the proportion belonging to each college:
prop.table(M, margin = 1)
## College ## Gender Business Engineering Liberal Arts Nursing Pharmacy ## Male 0.11052632 0.08421053 0.7631579 0.01052632 0.03157895 ## Female 0.06666667 0.01904762 0.8333333 0.06190476 0.01904762
Among males about 11% are enrolled in the Business college versus only 6% among females. These are the heights of our rectangles.
The scale of the plot is basically 0 to 1 on the x and y axes. So the first rectangle we see in the Male column for the Business college is 0.475 wide and about 0.11 tall. In the Female column, the rectangle for the Business college is 0.525 wide and about 0.07 tall. Visually we see there are more Females than Males in our sample because the Female rectangles are wider. Within the gender columns, we see Males have a higher proportion in the Business school than do Females because their rectangle is taller.
That’s what mosaic plots attempt to visualize: recursive proportions of membership within a n-dimension table.
Let’s try it on a table with 3 dimensions. Below we’ll use a data set that comes with R called UCBAdmissions. This data set contains “aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex.” This is a rather famous data set used for illustrating Simpson’s Paradox.
## , , Dept = A ## ## Gender ## Admit Male Female ## Admitted 512 89 ## Rejected 313 19 ## ## , , Dept = B ## ## Gender ## Admit Male Female ## Admitted 353 17 ## Rejected 207 8 ## ## , , Dept = C ## ## Gender ## Admit Male Female ## Admitted 120 202 ## Rejected 205 391 ## ## , , Dept = D ## ## Gender ## Admit Male Female ## Admitted 138 131 ## Rejected 279 244 ## ## , , Dept = E ## ## Gender ## Admit Male Female ## Admitted 53 94 ## Rejected 138 299 ## ## , , Dept = F ## ## Gender ## Admit Male Female ## Admitted 22 24 ## Rejected 351 317
mosaicplot(UCBAdmissions, main="Student Admissions at UC Berkeley")
How do we read this? Start with the Admit rows in our table of counts. That dictates the width of the two columns in the mosaic plot. Visually we see more people were rejected than admitted because the Rejected column of rectangles is wider. Next, go to the columns of the table: Gender. We see that of the people admitted, a much higher proportion were Male because of the height of the rectangles. Of the people rejected, it appears to be pretty even. Finally we move to the 3rd dimension: Dept. The height of these rectangles (or width, depending on how you look at it) is determined by proportion of Gender within Admit. So starting with the Admit column, compare the Dept rectangles between Male and Female. We see that a higher proportion of admitted Males were for Depts A and B compared to the proportion of admitted Females for the same Depts. On the other hand we see that a higher proportion of admitted Females were for Depts C – F compared to the proportion of admitted Males.
Were Depts A and B discriminating against Females? You might think so if you stop there. But look at the Rejected column. We see that of the rejected Males and Females, a much higher proportion of the Males were rejected for Depts A and B than Females. The widths of the Male rectangles are wider than their Female counterparts. Likewise for Depts C – F. It’s pretty clear that of the rejected Males and Females, a higher proportion of the Females were rejected for Depts C – F than Males. Again the widths of the Female rectangles are wider than their Male counterparts.
That’s where Simpson’s Paradox comes into play. If we disregard the within Dept counts, we see what appears to be Female discimination:
# collapse count over departments and create mosaic plot margin.table(UCBAdmissions, margin = c(1, 2))
## Gender ## Admit Male Female ## Admitted 1198 557 ## Rejected 1493 1278
mosaicplot(margin.table(UCBAdmissions, margin = c(1, 2)), main = "Student admissions at UC Berkeley")
To really understand what mosaic plots are showing, it helps to create one “by hand”. There’s no real point in doing so other than personal edification. But let’s be edified. We’ll work with our Univ of Iowa data.
We know our plot needs x and y axes with a scale of 0 to 1. We also know we need to draw rectangles. Fortunately R has a
rect function that allows you to create rectangles. You tell it the coordinate points for the bottom left and upper right corners of your rectangle and it does the rest.
In order to translate the width and height of rectangles to locations within the plot, we’ll need to use the
cumsum function. I need to draw rectangles relative to other rectangles. Hence the position of a rectangle corner will need to take into account other rectangles drawn above or beside it. The
cumsum function allows us to do that.
Here’s my rough stab at a manual mosaic plot:
# widths widths <- cumsum(c(0, apply(M, 1, function(x)sum(x)/sum(M)))) # heights pt <- prop.table(M,margin = 1) heightsM <- cumsum(c(0,pt[1,])) heightsF <- cumsum(c(0,pt[2,])) # Need to reverse the y axis plot(x=c(0,1), y=c(0,1), xlim=c(0,1), ylim=c(1,0), type = "n", xlab = "", ylab = "") # male rectangles rect(xleft = widths, ybottom = heightsM[-1], xright = widths, ytop = heightsM[-6], col=gray(seq(3,12,2) / 15)) # female rectangles rect(xleft = widths, ybottom = heightsF[-1], xright = widths, ytop = heightsF[-6], col=gray(seq(3,12,2) / 15))
If you compare that to the original
mosaicplot() output above that I drew at the beginning of this post you can see we’ve basically drawn the same thing without spacing around the rectangles. That’s why I used the
gray function to fill in the boxes with distinguishing shades. Again, nowhere near as nice as what the
mosaicplot give us, but a good way to understand what the
mosaicplot function is doing.