Monthly Archives: April 2016

Recreating a graph from a NCHS data brief

I was reading this sobering article about the rise of suicide rates in the United States and decided to read the NCHS data brief on which it was based. I noticed the source data was freely available online through the CDC Wonder databases, specifically the Detailed Mortality database, and figured I would take a look. One graph that caught my eye was Figure 2, which showed that suicide rates for females were higher in 2014 than in 1999 for all age groups under 75 years.

Figure 2

That's a very sad and concerning graph. What happened in the last 15 years? I have no theories about the increase in suicide rates, but I do admire the CDC statisticians who tackle this kind of gut-wrenching data, day in and day out. Let's join them for a moment and recreate this graph in R.

First we need to get the data from the Detailed Mortality database. Instead of directly downloading raw data, we submit a query for the data we want. There is a 7-step form to complete.

Step 1: Group by ICD-10 113 Cause List, Cause of death, Age Groups, Gender, and Year. This uses all available pull-down lists.

Step 2: leave as is. (ie, all US states)

Step 3: leave as is. (ie, all genders, ages, and races)

Step 4: select 1999 and 2014 using Ctrl + Click to select both years

Step 5: leave as is. (ie, all weekdays, autopsies, and places of death)

Step 6: select the ICD-10 113 Cause List radio button and select “Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)”

Step 7: click Send.

The results should appear in your web browser. If everything looks good, click the Export button to download a tab-delimited file of the results. It will have a .txt extenstion. Now that we have our data, we can go to R and get to work.

Before we read in the data we should open it in a text editor and see what we got. We have a header row, so that's good. If we scroll down, we'll see there's about 70 lines of metadata. This is data about our data, such as where it came from, when we downloaded it, etc. We don't want that in our data frame, but we should save it with our data frame as an attribute.

To start we'll use readLines to read in the data one line at a time. I chose to do this as a way to programmatically figure out where the metadata begins. I noticed it begins after a divider of three dashes (“—”), so I use grep in combination with min to find the first occurence of this divider and save a n.

raw <- readLines("Underlying Cause of Death, 1999-2014 (3).txt")
## Warning in readLines("Underlying Cause of Death, 1999-2014 (3).txt"):
## incomplete final line found on 'Underlying Cause of Death, 1999-2014
## (3).txt'
n <- min(grep(pattern = "---",raw))

Now I can use n with the nrows argument in the read.delim function to tell it how many rows I want to read in. Notice I have to say n - 2. I have to subtract the header row and the divider row itself. I also specify “Not Applicable” and “Unreliable” as NA values. Finally, take a look at my file name. Yep, it took me a few tries to get the data I wanted. I'm not too proud to admit that. :)

dat <- read.delim(file = "Underlying Cause of Death, 1999-2014 (3).txt", header = TRUE, nrows = n-2,
                  na.strings = c("Not Applicable","Unreliable"))

Let's go ahead and save the metadata with our data frame. We can do that with the comment function. We don't need it to create the graph, but it's good practice to preserve this kind of information. To see the metadata after we save, just enter comment(dat). It's not pretty when it's printed to the console, but we have it if we need it.

comment(dat) <- raw[n:length(raw)]

Now we need to clean up the data. First order of business is to reorder the age groups. By default, R brought them in as a factor and put the 15-24 age group as the first level. We want that first level to be the 5-14 age group. It's already a factor, but we reorder the levels by making the column a factor again and specifying the levels in the specified order. The original 5th level comes first and then everything else. There may be a better way of doing this, but it works.

dat$Ten.Year.Age.Groups.Code <- factor(dat$Ten.Year.Age.Groups.Code, 
                                       levels = levels(dat$Ten.Year.Age.Groups.Code)[c(5,1,2,3,4,6,7,8,9,10)])

We only need females, so let's subset. I make a new data frame called dat2.

dat2 <- subset(dat, Gender=="Female")

Now if we look at Figure 2 again, the x-axis has 6 age groups, but our data has 9 age groups. So we need to combine a few age groups. Specifically we need to combine the “25-34” and “35-44” groups into “25-44”, combine the “45-54” and “55-64” groups into “45-64”, and the “75-84” and “85+” into “75+”. That's not too bad. We just reset the levels.

levels(dat2$Ten.Year.Age.Groups.Code) <- c("5-14", "15-24", "25-44", "25-44", "45-64", "45-64",
                                           "65-74", "75+", "75+", "NS")

Since we have 6 age groups instead of 9, we need to combine death totals for the three new age groups. We also need to combine death totals for all the various causes of deaths. We can do that with aggregate. Below we sum Deaths by Ten.Year.Age.Groups.Code and Year. This returns a new data frame that I call dat3.

(dat3 <- aggregate(Deaths ~ Ten.Year.Age.Groups.Code + Year, data=dat2, sum))
##    Ten.Year.Age.Groups.Code Year Deaths
## 1                      5-14 1999     50
## 2                     15-24 1999    575
## 3                     25-44 1999   2359
## 4                     45-64 1999   1868
## 5                     65-74 1999    420
## 6                       75+ 1999    469
## 7                      5-14 2014    151
## 8                     15-24 2014    990
## 9                     25-44 2014   3018
## 10                    45-64 2014   4195
## 11                    65-74 2014    828
## 12                      75+ 2014    477
## 13                       NS 2014      1

Next I need to calculate “deaths per 100,000” for each age group. We do this by dividing Deaths by the age group population and multiplying by 100,000. I need to add the age group populations to dat3 to do this. First I need to sum the populations for my new age groups. I can use aggregate again with dat2, but I need to remove duplicate rows of age group population values before I do this. I decided to save the de-duped data frame to tmp and not overwrite dat2

tmp <- subset(dat2, select = c("Year", "Ten.Year.Age.Groups.Code", "Population"))
tmp <- tmp[!duplicated(tmp),]
(dat4 <- aggregate(Population ~ Ten.Year.Age.Groups.Code + Year, data=tmp, sum))
##    Ten.Year.Age.Groups.Code Year Population
## 1                      5-14 1999   19908282
## 2                     15-24 1999   18860184
## 3                     25-44 1999   42614694
## 4                     45-64 1999   31011068
## 5                     65-74 1999   10125424
## 6                       75+ 1999   10371906
## 7                      5-14 2014   20161424
## 8                     15-24 2014   21456371
## 9                     25-44 2014   41900194
## 10                    45-64 2014   42789506
## 11                    65-74 2014   14049245
## 12                      75+ 2014   11842674

Now I'm ready to merge dat3 and dat4 so I have Deaths and Populations in one data frame. Before I do that I remove the row with age group of “NS” which means “Not Stated”. We don't need that row.

dat3 <- subset(dat3, Ten.Year.Age.Groups.Code != "NS")
dat5 <- merge(dat3,dat4,by = c("Year","Ten.Year.Age.Groups.Code"))

And finally we can calculate Rate. I decided to also sort the data frame as well.

dat5$Rate <- round(dat5$Deaths/dat5$Population * 100000,1)
(dat5 <- dat5[order(dat5$Ten.Year.Age.Groups.Code, dat5$Year),])
##    Year Ten.Year.Age.Groups.Code Deaths Population Rate
## 4  1999                     5-14     50   19908282  0.3
## 10 2014                     5-14    151   20161424  0.7
## 1  1999                    15-24    575   18860184  3.0
## 7  2014                    15-24    990   21456371  4.6
## 2  1999                    25-44   2359   42614694  5.5
## 8  2014                    25-44   3018   41900194  7.2
## 3  1999                    45-64   1868   31011068  6.0
## 9  2014                    45-64   4195   42789506  9.8
## 5  1999                    65-74    420   10125424  4.1
## 11 2014                    65-74    828   14049245  5.9
## 6  1999                      75+    469   10371906  4.5
## 12 2014                      75+    477   11842674  4.0

A quick comparison of our rates with those in Figure 2 shows that we're in agreement. Now it's time to re-create the graph. For this I chose to use ggplot2. To match the colors, I uploaded a picture of the original graph to this wonderful site and figured out the shades of green were #6D9F40 (dark green) and #CFDABA (light green).

library(ggplot2)
ggplot(dat5, aes(x=Ten.Year.Age.Groups.Code, y=Rate, fill=factor(Year))) +
  geom_bar(stat="identity", position = "dodge") +
  labs(x="Age (years)",y="Deaths per 100,000 in specified group") +
  scale_fill_manual("Year", values = c("#6D9F40","#CFDABA")) +
  annotate(geom = "text", x = (rep(1:6,each=2) + c(-1,1)*0.22), 
           y = dat5$Rate + 0.2, 
           label = sprintf("%.1f", dat5$Rate)) +
  theme_bw() +
  scale_y_continuous(breaks=seq(0,10,2))

plot of chunk unnamed-chunk-11

The first line defines the aesthetics. I want Year to be treated like a categorical variable, so I call it with factor. The next lines says, “don't count the rows to create the bar graph, use the actual values of Rate (ie, identity).” It also says “dodge”“ the Years so they're next to each other (instead of stacked). The third line labels the axes. The fourth line titles the legend and defines the colors. The annontate line adds the counts to the plot. That took some trial and error! The x and y values are the coordinates. The label is the Rate values rendered as character with one decimal place. The theme_bw() line is an easy way to get rid of the default grey background and the last line relabels the y axis.