The Secret Weapon: section 10.9 of Regression and Other Stories

In section 10.9 of Regression and Other Stories, the authors introduce the idea of fitting the same model to many data sets. They begin the section by saying, “it is common to fit a regression model repeatedly…to subsets of an existing data set.” But then two paragraphs later call this idea a “secret weapon” because “it is so easy and powerful but yet is rarely used.” So which is it? Common or rarely used?

Anyway, here’s the code they use to demonstrate this idea using NES data. This creates Figure 10.9 on page 149. While I enjoyed working through this code and figuring out how it works, I decided to re-implement it using data frames and ggplot2.

First I load the {rstanarm} package and read in the data.

library(rstanarm)
data <- read.table("https://github.com/avehtari/ROS-Examples/raw/refs/heads/master/NES/data/nes.txt")

Next I modify their custom function as follows. The changes are in the coefs object that is created. Instead of a vector, I return a data frame. I also add the variable name and year to the data frame.

coef_names <- c("Intercept", "Ideology", "Black", "Age_30_44", 
                "Age_45_64", "Age_65_up", "Education", "Female", "Income")
regress_year <- function (yr) {
  this_year <- data[data$year==yr,]
  fit <- stan_glm(partyid7 ~ real_ideo + race_adj + factor(age_discrete) +
                    educ1 + female + income,
                  data=this_year, warmup = 500, iter = 1500, refresh = 0,
                  save_warmup = FALSE, cores = 1, open_progress = FALSE)
    coefs <- data.frame(var = coef_names, 
                      coef = coef(fit), 
                      se = se(fit), 
                      year = yr)
}

Now I run the function using lapply(), combine the list of data frames into one data frame, add upper and lower limits of a 50% confidence interval, and set the variable name as a factor.

sum2 <- lapply(seq(1972,2000,4), regress_year)
sumd <- do.call(rbind, sum2)
sumd$upper <- sumd$coef + sumd$se*0.67
sumd$lower <- sumd$coef - sumd$se*0.67
sumd$var <- factor(sumd$var, levels = coef_names)

And finally I create the plot using {ggplot2}.

library(ggplot2)
ggplot(sumd) +
  aes(x = year, y = coef) +
  geom_point() +
  geom_errorbar(mapping = aes(ymin = lower, ymax = upper), width = 0) +
  geom_hline(yintercept = 0, linetype = 2) +
  facet_wrap(~ var, scales = "free") +
  theme_classic()

According to the book, “ideology and ethnicity are the most important, and their coefficients have been changing over time.”

The above approach uses Bayesian modeling via the {rstanarm} package. We can also deploy the secret weapon using a frequentist approach via the lmList() function from the {lme4} package. Below I first subset the data to select only the columns I need for the years 1972 - 2000. This is necessary due to the amount of missingness in the data.

library(lme4)
vars <- c("partyid7", "real_ideo" , "race_adj" , "age_discrete" ,
  "educ1" , "female" , "income", "year")
d <- data[data$year %in% seq(1972,2000,4),vars]
fm1 <- lmList(partyid7 ~ real_ideo + race_adj + factor(age_discrete) +
                educ1 + female + income | year, 
              data = d)

Calling summary on the object lists a summary of coefficients over time.

summary(fm1)

Call:
  Model: partyid7 ~ real_ideo + race_adj + factor(age_discrete) + educ1 + female + income | NULL 
   Data: d 

Coefficients:
   (Intercept) 
        Estimate Std. Error    t value     Pr(>|t|)
1972  1.76583047  0.3713743  4.7548538 2.019151e-06
1976  1.11162271  0.4038796  2.7523615 5.929490e-03
1980  1.70511202  0.5101462  3.3423987 8.342261e-04
1984  2.27900236  0.3731396  6.1076403 1.056356e-09
1988  3.04892311  0.4003568  7.6155146 2.913307e-14
1992  1.44675559  0.3479402  4.1580579 3.241743e-05
1996 -0.06088684  0.4523491 -0.1346014 8.929303e-01
2000  0.71444233  0.6875610  1.0390967 2.987898e-01
   real_ideo 
      Estimate Std. Error  t value      Pr(>|t|)
1972 0.4846373 0.04058343 11.94175  1.320025e-32
1976 0.5874170 0.04129972 14.22327  2.220613e-45
1980 0.6039183 0.04995714 12.08873  2.298924e-33
1984 0.6262146 0.03860522 16.22098  2.771748e-58
1988 0.6219214 0.03971156 15.66097  1.659414e-54
1992 0.7075364 0.03500057 20.21499  9.375655e-89
1996 0.9364460 0.04067273 23.02393 8.874569e-114
2000 0.7892252 0.06129477 12.87590  1.399443e-37
   race_adj 
      Estimate Std. Error    t value     Pr(>|t|)
1972 -1.105586  0.1870994  -5.909083 3.575067e-09
1976 -1.097028  0.1980980  -5.537805 3.155805e-08
1980 -1.284468  0.2429898  -5.286098 1.281122e-07
1984 -1.483666  0.1812912  -8.183885 3.153604e-16
1988 -1.732068  0.1721402 -10.061961 1.109738e-23
1992 -1.346218  0.1564942  -8.602349 9.232858e-18
1996 -1.220354  0.1846709  -6.608263 4.126844e-11
2000 -1.079103  0.2948775  -3.659496 2.542700e-04
   factor(age_discrete)2 
        Estimate Std. Error    t value   Pr(>|t|)
1972 -0.18895861  0.1373869 -1.3753757 0.16905188
1976 -0.03744826  0.1486541 -0.2519154 0.80111262
1980 -0.14625870  0.1939140 -0.7542451 0.45072333
1984 -0.23055707  0.1410016 -1.6351374 0.10205793
1988 -0.30995572  0.1512620 -2.0491320 0.04048039
1992 -0.21210428  0.1482977 -1.4302602 0.15267977
1996 -0.02979256  0.1829521 -0.1628435 0.87064561
2000 -0.45006072  0.2959671 -1.5206443 0.12838698
   factor(age_discrete)3 
        Estimate Std. Error    t value     Pr(>|t|)
1972 -0.04740623  0.1347712 -0.3517535 7.250320e-01
1976 -0.05728971  0.1461493 -0.3919945 6.950723e-01
1980 -0.38373449  0.1947510 -1.9703849 4.882726e-02
1984 -0.66715561  0.1531480 -4.3562810 1.338718e-05
1988 -0.45235822  0.1629587 -2.7759065 5.517059e-03
1992 -0.50779145  0.1591265 -3.1911175 1.422483e-03
1996 -0.27181837  0.1907075 -1.4253153 1.541034e-01
2000 -0.71662580  0.2995257 -2.3925355 1.675438e-02
   factor(age_discrete)4 
       Estimate Std. Error    t value    Pr(>|t|)
1972  0.5125853  0.1772718  2.8915225 0.003843694
1976  0.4486697  0.1889933  2.3739982 0.017619118
1980  0.0238989  0.2292029  0.1042696 0.916957892
1984 -0.2458586  0.1823842 -1.3480260 0.177686577
1988 -0.4002949  0.1916484 -2.0886944 0.036765446
1992 -0.4130676  0.1714062 -2.4098761 0.015979426
1996 -0.1150134  0.2071408 -0.5552427 0.578743539
2000 -0.4803803  0.3324209 -1.4450965 0.148468300
   educ1 
       Estimate Std. Error  t value     Pr(>|t|)
1972 0.29708140 0.05826098 5.099149 3.486782e-07
1976 0.27725165 0.06059524 4.575469 4.820110e-06
1980 0.09550340 0.08261012 1.156074 2.476841e-01
1984 0.07262794 0.06501416 1.117110 2.639796e-01
1988 0.14341231 0.06473108 2.215509 2.675200e-02
1992 0.28034637 0.05916944 4.738026 2.193737e-06
1996 0.25114897 0.07093933 3.540335 4.017956e-04
2000 0.24461379 0.10801197 2.264692 2.355714e-02
   female 
         Estimate Std. Error    t value  Pr(>|t|)
1972 -0.005967246 0.10018983 -0.0595594 0.9525080
1976  0.133917095 0.10627385  1.2601133 0.2076637
1980  0.028503792 0.13923027  0.2047241 0.8377927
1984 -0.013415054 0.10477430 -0.1280376 0.8981223
1988 -0.079195190 0.11057733 -0.7161974 0.4738895
1992 -0.068672465 0.09994152 -0.6871265 0.4920221
1996 -0.059622697 0.11476049 -0.5195403 0.6033978
2000 -0.094042287 0.17291800 -0.5438548 0.5865559
   income 
       Estimate Std. Error  t value     Pr(>|t|)
1972 0.16082722 0.05077800 3.167262 1.544359e-03
1976 0.17180218 0.05780102 2.972303 2.964176e-03
1980 0.22816242 0.07197937 3.169831 1.530785e-03
1984 0.22486650 0.05569547 4.037429 5.452323e-05
1988 0.06352031 0.05891789 1.078116 2.810132e-01
1992 0.13290845 0.05188253 2.561719 1.043296e-02
1996 0.20801839 0.05996115 3.469220 5.246072e-04
2000 0.23581142 0.08847826 2.665190 7.709274e-03

Residual standard error: 1.815982 on 8351 degrees of freedom

To create the plot, I need to do some data wrangling. Below I extract the coefficients from the summary, which returns an array. I then use the adply() function from the {plyr} package to convert the array to a data frame. Then I add year, upper 50% CI limit, and lower 50% CI limit to the data frame. I also change the variable column to a factor so the order of the coefficients will be preserved in the plot.

sout <- summary(fm1)$coefficients

library(plyr)
sumd2 <- adply(sout, .margins = 3, .id = "Var")
sumd2$year <- seq(1972,2000,4)
sumd2$upper <- sumd2$Estimate + sumd2$`Std. Error`*0.67
sumd2$lower <- sumd2$Estimate - sumd2$`Std. Error`*0.67
sumd2$Var <- factor(sumd2$Var, labels = coef_names)

And once again I create the plot.

ggplot(sumd2) +
  aes(x = year, y = Estimate) +
  geom_point() +
  geom_errorbar(mapping = aes(ymin = lower, ymax = upper), width = 0) +
  geom_hline(yintercept = 0, linetype = 2) +
  facet_wrap(~ Var, scales = "free") +
  theme_classic()

The result is almost identical to the one created using {rstanarm}.