Chapter 20 of Regression and Other Stories presents a toy example in section 20.8 of calculating an average treatment effect using weights based on the estimated inverse probability of treatment assignment. They leave out several steps in how they arrive at their values. In this post I fill in those steps. The data is presented in figure 20.15, which is actually a table. Here is part of the table.
n <- c(80, 50, 20, 50)
x <- c(0, 1, 0, 1)
t <- c(0, 0, 1, 1)
y <- c(110, 90, 120, 90)
d <- data.frame(n, x, t, y)
d
## n x t y
## 1 80 0 0 110
## 2 50 1 0 90
## 3 20 0 1 120
## 4 50 1 1 90
The first column, n, tells us the sample size for each condition. This columns sums to 200 total subjects. The second column, x, is a binary covariate. The third column, t, tells us the treatment (0 = control, 1 = treated). And the fourth column, y, tells us the outcome.
A naive estimate of the average treatment effect can be calculated as follows. Notice we need to use the sample size to accurately calculate the mean.
weighted.mean(y[3:4], n[3:4]) - weighted.mean(y[1:2], n[1:2])
## [1] -3.736264
This indicates the treatment causes a small decrease in the outcome. But this is wrong because the data is unbalanced in the covariate. In the control group (t = 0), there are 80 subjects with x = 0 and 50 with x = 1. However, the treatment group (t = 1) has 20 subjects with x = 0 and 50 subjects with x = 1. Using propensity scores we can reweight the data such that’s it’s balanced in the covariate.
The propensity score is the estimated probability of receiving the treatment. When x = 0, the estimated probability of receiving the treatment is 20/(20 + 80), or 0.2. When x = 1, the estimated probability of receiving the treatment is 50/(50 + 50), or 0.5.
d$p <- c(0.2, 0.5, 0.2, 0.5)
d
## n x t y p
## 1 80 0 0 110 0.2
## 2 50 1 0 90 0.5
## 3 20 0 1 120 0.2
## 4 50 1 1 90 0.5
We calculate weights by taking the inverse of the propensity score. Treated units receive weights of 1/p, while control units receive weights of 1/(1 – p). The weights for control units are the inverse of the probability they ended up in the control group.
d$w <- ifelse(d$t == 1, 1/d$p, 1/(1 - d$p))
d
## n x t y p w
## 1 80 0 0 110 0.2 1.25
## 2 50 1 0 90 0.5 2.00
## 3 20 0 1 120 0.2 5.00
## 4 50 1 1 90 0.5 2.00
Next we “norm” the weights so they average to 1. This is carried out by dividing by the average of the unnormed weights. The average of the unnormed weights needs to be calculated as a weighted mean. Then we can do the division.
avg <- weighted.mean(d$w, d$n)
d$nw <- d$w/avg
d
## n x t y p w nw
## 1 80 0 0 110 0.2 1.25 0.625
## 2 50 1 0 90 0.5 2.00 1.000
## 3 20 0 1 120 0.2 5.00 2.500
## 4 50 1 1 90 0.5 2.00 1.000
To get the reweighted data (or pseudo-population) that is balanced in the covariate, we multiply the group samples sizes by the normed weight.
# pseudo-population
d$pn <- d$nw * n
d
## n x t y p w nw pn
## 1 80 0 0 110 0.2 1.25 0.625 50
## 2 50 1 0 90 0.5 2.00 1.000 50
## 3 20 0 1 120 0.2 5.00 2.500 50
## 4 50 1 1 90 0.5 2.00 1.000 50
Notice we have balance across the groups. Each group now has a pseudo-population of 50. Now we can estimate the average treatment effect, which comes out to the true value of 5.
weighted.mean(d$y[3:4], d$pn[3:4]) - weighted.mean(d$y[1:2], d$pn[1:2])
## [1] 5
The reason we know 5 is the correct value is because the authors provide the hypothetical “potential outcomes” for both treatment groups in the table. y0 is the potential outcome when untreated and y1 is the potential outcome when treated.
data.frame(t = t,
y0 = c(110, 90, 110, 90),
y1 = c(120, 90, 120, 90))
## t y0 y1
## 1 0 110 120
## 2 0 90 90
## 3 1 110 120
## 4 1 90 90
With this information we can determine the true average treatment effect is 5.
mean(c(120,90)) - mean(c(110,90))
## [1] 5