The story involves the election of a board of directors for a “residential organization”. 5553 people were allowed to vote for up to 6 people. 27 candidates were running for the board. Votes were tallied after 600 people voted, then again at 1200, 2444, 3444, 4444, and the end after all 5553 people voted. What aroused suspicion was the fact that the proportion of votes for the candidates remained steady each time the votes were tallied. According to the author of the fax: “the election was rigged…[it] is a fixed vote with fixed percentages being assigned to each and every candidate making it impossible to participate in an honest election.”

Let’s read in the data and demonstrate what they’re talking about. Notice this data is the rare CSV without column headers. The data consists of 27 rows, one for each candidate, showing cumulative vote totals.

```
data <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Coop/data/Riverbay.csv",
header = FALSE)
# drop 1st and 8th columns; contain candidate names which we don't need.
votes <- data[,2:7]
head(votes)
```

```
## V2 V3 V4 V5 V6 V7
## 1 208 416 867 1259 1610 2020
## 2 55 106 215 313 401 505
## 3 133 250 505 716 902 1129
## 4 101 202 406 589 787 976
## 5 108 249 512 745 970 1192
## 6 54 94 196 279 360 451
```

Now let’s calculate the proportion of votes received at each interval and create a basic line plot. Each line below represents proportion of votes received for a candidate at each of the six intervals. Notice how the lines are mostly flat. This is what prompted the emergency fax.

```
vote_p <- apply(votes, 2, proportions)
matplot(t(vote_p), type = "l", col = 1, lty = 1)
```

Gelman, et al. demonstrate this using separate plots for the top 8 vote-getters (Fig 4.5). They also divide by number of voters instead of total votes received. (Remember, each voter gets to vote for up to six people.) This simply changes the denominator, and hence, the y-axis. The steady vote patterns remain.

```
voters <- c(600,1200,2444,3444,4444,5553)
vote_p <- sweep(votes, 2, voters, FUN = "/")
matplot(t(vote_p), type = "l", col = 1, lty = 1)
```

They note that the data in this plot is not independent since proportions at times 2 and beyond include votes that came before. To address this, they create a matrix that contains number of votes received at *each interval* instead of cumulative totals.

```
interval_votes <- t(apply(votes, 1, diff))
interval_votes <- cbind(votes[,1], interval_votes)
head(interval_votes)
```

```
## V3 V4 V5 V6 V7
## [1,] 208 208 451 392 351 410
## [2,] 55 51 109 98 88 104
## [3,] 133 117 255 211 186 227
## [4,] 101 101 204 183 198 189
## [5,] 108 141 263 233 225 222
## [6,] 54 40 102 83 81 91
```

After taking differences the lines still seem mostly stable.

```
interval_p <- apply(interval_votes, 2, proportions)
matplot(t(interval_p), type = "l", col = 1, lty = 1)
```

Again, the authors divide by number of voters instead of total votes to create these plots, but the result is the same with a different y-axis. Here’s how I would do the calculations and create the plot.

```
interval_voters <- c(600, diff(voters))
interval_p <- sweep(interval_votes, 2, interval_voters, FUN = "/")
matplot(t(interval_p), type = "l", col = 1, lty = 1)
```

And now comes the hypothesis test. What is the probability of seeing steady proportions like this if the votes really were coming in at random? I’ll quote the book here: “Because the concern was that the votes were unexpectedly stable as the count proceeded, we define a test statistic to summarize variability.” The test statistic in this case is the standard deviations of the sample proportions. We can quickly get these from the interval_p object we created above.

`test_stat <- apply(interval_p, 1, sd)`

Now we need to calculate the theoretical test statistic. For this we assume each candidate has a fixed but unknown proportion of voters who will vote for them, \(\pi_i\). Under the null, the six intervals where votes are tallied are random samples of the voters. So at each time point we can think of the proportion as a draw from a distribution with mean \(\pi_i\) and standard deviation \(\sqrt{\pi_i(1 – \pi_i)/n_t}\), where \(n_t\) is the number of voters at each interval. To calculate this, we first need to estimate \(\pi_i\) with \(p_i\), the observed proportion of votes each candidate received. This is the last column of the votes data frame divided by the total number of voters, 5553.

`p_hat <- votes[,6]/5553`

Then we take the average of the variances calculated at each time point and take the square root to get the theoretical test statistic.

`theory_test_stat <- sapply(p_hat, function(x)sqrt(mean(x*(1-x)/interval_voters)))`

Under the null, the observed test statistics should be very close to the theoretical test statistics. This is assessed in Fig 4.7 in the book. I replicate the plot as follows:

```
plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
ylab = "sd of separate vote proportions")
points(x = votes[,6], y = theory_test_stat, pch = 19)
```

The authors note that “the actual standard deviations appear consistent with the theoretical model.”

Personally I think the plot would be a little more effective if they zoomed out a little. Some of the dramatic looking departures are only off by 0.01. For example:

```
plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
ylab = "sd of separate vote proportions", ylim = c(0,0.05))
points(x = votes[,6], y = theory_test_stat, pch = 19)
```

Another null hypothesis approach is the chi-square test of association. Under the null, the number of votes is not associated with the interval when votes were tallied. We can run this test for each candidate and look at the p-values. If there is no association for each candidate we should see a fairly uniform scatter of p-values. On the other hand, if there was “suspiciously little variation over time” we would see a surplus of high p-values. Here’s how I carried out these calculations. I first created the 2-way tables of yes/no versus time for each candidate. I then applied the chi-square test to each table, and to that result, I extracted each p-value. A uniform QQ plot shows the p-values are mostly uniformly distributed.

```
tables <- apply(interval_votes, 1, function(x) rbind(x, interval_voters - x),
simplify = FALSE)
chisq_out <- lapply(tables, chisq.test, correct = FALSE)
p_values <- sapply(chisq_out, function(x)x$p.value)
qqplot(ppoints(27), p_values)
qqline(p_values, distribution = qunif)
```

Finally the authors mention that a single test on the entire 27 x 6 table could be performed. This seems like the easiest approach of all.

`chisq.test(interval_votes, correct = F)`

```
##
## Pearson's Chi-squared test
##
## data: interval_votes
## X-squared = 114.72, df = 130, p-value = 0.8279
```

My R code differs quite a bit from the R code provided by the authors. I’m not saying mine is better, it just makes more sense to me. Maybe someone else will find this approach useful.

]]>```
x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
```

```
## xbar s
## 8.333385 7.979586
```

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

`ks.test(x, "pnorm", mean(x), sd(x))`

```
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided
```

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

`ks.test(x, "pnorm", 8, 8)`

```
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided
```

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

```
n <- 200
rout <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
xbar <- mean(x)
s <- sd(x)
ks.test(x, "pnorm", xbar, s)$p.value
})
hist(rout, main = "Histogram of p-values")
```

```
qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot")
qqline(rout, distribution = qunif)
```

Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.”

When done the right way, the KS test yields uniformly distributed p-values.

```
rout2 <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)
```

```
qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)
```

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

- draw a random sample from the fitted distribution
- get estimates of parameters of random sample
- obtain the empirical distribution function
- calculate the bootstrapped KS statistic
- repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use `MASS::fitdistr()`

to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

```
param <- MASS::fitdistr(x, "normal")$estimate
ks <- ks.test(x, function(x)pnorm(x, param[1], param[2]))
stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)
## bootstrapping
for (i in 1:B) {
# (1) draw a random sample from a fitted dist
x.b <- rnorm(n, param[1], param[2])
# (2) get estimates of parameters of random sample
fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
# (3) get empirical distribution function
Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
# (4) calculate bootstrap KS statistic
stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
```

`## [1] 0.61`

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

```
ks.boot <- function(x, B = 1000){
param <- MASS::fitdistr(x, "normal")$estimate
ks <- ks.test(x, function(k)pnorm(k, param[1], param[2]))
stat <- ks$statistic
stat.b <- double(B)
n <- length(x)
for (i in 1:B) {
x.b <- rnorm(n, param[1], param[2])
fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
}
```

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

```
rout_boot <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
ks.boot(x)
})
```

`hist(rout_boot)`

```
qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)
```

]]>The R script that performs the age adjustment is births.R. It clocks in at over 400 lines has practically no comments outside of the occasional “Sum it up.” As you run the code, you’ll find the script generates several plots not in the book. In addition, the plots that *are* in the book are generated in a different order. Trying to parse the R code to help me understand the exposition was frustrating. But I persisted.

Reading the bibliographic note at the end of the chapter indicated the age adjustment example was first discussed on Gelman’s blog. In the blog post he walks through the process of age adjustment, creating the same plots in the book, and provides the R code. This is basically the births.R script. He says at the end, “the code is ugly. Don’t model your code after my practices! If any of you want to make a statistics lesson out of this episode, I recommend you clean the code.”

This blog post is my statistics lesson trying to understand and clean this code.

The data apparently come from the CDC, but I’m using the data file Gelman provides with his R code. The data shows number of deaths per age per gender per year for white non-hispanics in the US. For example, the first row shows 1291 female deaths (Male = 0) in 1999 for those who were 35 years old. The total population of 35 year old women in 1999 was 1,578,829. The rate is 1291/1,578,829 x 100,000 = 81.8, or 81 deaths per 100,000.

```
data <- read.table("white_nonhisp_death_rates_from_1999_to_2013_by_sex.txt",
header=TRUE)
head(data)
```

```
## Age Male Year Deaths Population Rate
## 1 35 0 1999 1291 1578829 81.8
## 2 35 0 2000 1264 1528463 82.7
## 3 35 0 2001 1186 1377466 86.1
## 4 35 0 2002 1194 1333639 89.5
## 5 35 0 2003 1166 1302188 89.5
## 6 35 0 2004 1166 1325435 88.0
```

The first plot is mortality rate of the 45-54 age group from 1999 – 2013. We first sum both Deaths and Population by year and then calculate the Mortality Rate by dividing Deaths by Population. This is a nice opportunity to use the base R pipe operator, `|>`

.

```
aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population) |>
plot(Rate ~ Year, data = _, type = "l", ylab = "Death Rate")
```

This is the third plot in Gelman’s blog post titled “So take the ratio!”

The second plot shows the average of the 45-54 age group increasing as the baby boomers move through. First we sum the Population by Age and Year for the 45-54 group. Then we take that data and take the mean age per year weighted by the population.

```
aggregate(Population ~ Age + Year, data = data, sum,
subset = Age %in% 45:54) |>
aggregate(Population ~ Year, data = _,
function(x)weighted.mean(45:54, x)) |>
plot(Population ~ Year, data = _, type = "l")
```

To help make this clear, let’s find the mean age of the 45-54 group in 1999. First find the population for each age in 1999:

```
tmp <- aggregate(Population ~ Age + Year, data = data, sum,
subset = Age %in% 45:54 & Year == 1999)
tmp$Population
```

```
## [1] 3166393 3007083 2986252 2805975 2859406 2868751 2804957 3093631 2148382
## [10] 2254975
```

To find the mean age of the 45-54 group in 1999, we need to weight each age with the population. We can do that with the `weighted.mean()`

function.

`weighted.mean(45:54, tmp$Population)`

`## [1] 49.25585`

The code above does this for 1999-2013. I think it’s worth noting that while the plot looks dramatic, the average age only increases from about 49.2 to 49.7. But I suppose when you’re dealing with millions of people that increase makes a difference.

This is the fourth plot in Gelman’s blog post titled “But the average age in this group is going up!”

This is where I began to struggle when reading the book.

This figure is titled “The trend in raw death rates since 2005 can be explained by age-aggregation bias”. This is the eighth plot in the blog post where it has a bit more motivation. Let’s recreate the plots in the blog post leading up to this plot.

The first plot is the sixth plot. It’s basically the previous plot rescaled as a rate. It’s created by first calculating the death rate in 1999, and then taking the weighted mean of that rate by using the total population for each age group.

```
dr1999 <- aggregate(cbind(Deaths, Population) ~ Age, data = data, FUN = sum,
subset = Age %in% 45:54 & Year == 1999) |>
transform(Rate = Deaths/Population)
# Now create plot
aggregate(Population ~ Age + Year, data = data, sum,
subset = Age %in% 45:54) |>
aggregate(Population ~ Year, data = _,
function(x)weighted.mean(dr1999$Rate, x)) |>
plot(Population ~ Year, data = _, type = "l", ylab = "Reconstructed death rate")
```

Next he combines this plot with the plot of the raw death rate (Fig 2.11 (a)). This is the seventh plot in the blog post.

```
years <- 1999:2013
Raw <- aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population)
Expected <- aggregate(Population ~ Age + Year, data = data, sum,
subset = Age %in% 45:54) |>
aggregate(Population ~ Year, data = _,
function(x)weighted.mean(dr1999$Rate, x))
plot(years, Raw$Rate, type="l", ylab="Death rate for 45-54 non-Hisp whites")
lines(years, Expected$Population, col="green4")
text(2002.5, .00404, "Raw death rate", cex=.8)
text(2009, .00394, "Expected just from\nage shift", col="green4", cex=.8)
```

Then finally he says, “We can sharpen this comparison by anchoring the expected-trend-in-death-rate-just-from-changing-age-composition graph at 2013, the end of the time series, instead of 1999.” This means we need to calculate the death rate in 2013, and then take the weighted mean of that rate by using the total population for each age group. This is the dr2013 data frame. Then we create the same plot as above except now using the death rate in 2013.

```
dr2013 <- aggregate(cbind(Deaths, Population) ~ Age, data = data, FUN = sum,
subset = Age %in% 45:54 & Year == 2013) |>
transform(Rate = Deaths/Population)
Raw <- aggregate(cbind(Deaths, Population) ~ Year, data = data, FUN = sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population)
Expected <- aggregate(Population ~ Age + Year, data = data, sum,
subset = Age %in% 45:54) |>
aggregate(Population ~ Year, data = _,
function(x)weighted.mean(dr2013$Rate, x))
plot(years, Raw$Rate, type="l",
ylab="Death rate for 45-54 non-Hisp whites")
lines(years, Expected$Population, col="green4")
text(2002.5, 0.00395, "Raw death rate", cex=.8)
text(2002, .00409, "Expected just from\nage shift", col="green4", cex=.8)
```

Gelman notes, “since 2003, all the changes in raw death rate in this group can be explained by changes in age composition.”

This is the first plot showing age-adjusted death rates. Gelman explains this as follows in his blog post: “for each year in time, we take the death rates by year of age and average them, thus computing the death rate that would’ve been observed had the population distribution of 45-54-year-olds been completely flat each year.” The book calls it “the simplest such adjustment, normalizing each year to a hypothetical uniformly distributed population in which the number of people is equal at each age from 45 through 54.” I found this latter explanation a little confusing.

To create this plot we first sum Deaths and Populations by age and year for the 45-54 age group, then calculate the death rate, and then simply take the mean rate by year. That’s it. Gelman takes the additional step of rescaling the rate so that the rate is 1 in 1999.

```
aggregate(cbind(Deaths, Population) ~ Age + Year, data = data, sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population) |>
aggregate(Rate ~ Year, data = _, mean) |>
transform(AA_Rate = Rate/Rate[1]) |> # relative to 1999
plot(AA_Rate ~ Year, data = _, type = "l",
ylab = "age-adjusted death rate, relative to 1999")
```

In the book, this plot shows two different age adjustments, even thought the exposition says there are three. They’re probably referring to the original blog post plot which does show three. I recreate the plot in the blog post, which is the second to last plot.

This plot shows (1) age-adjustment using the simple mean of rates, (i.e., the plot above), (2) age-adjustment using the distribution of ages in 1999, and (3) age-adjustment using the distribution of ages in 2013.

This plot requires the most work of all. First we need to get the total population for all ages in 1999 and 2013. These are used to make the age adjustments.

```
pop1999 <- aggregate(Population ~ Age, data = data,
subset = Year == 1999 & Age %in% 45:54, sum)[["Population"]]
pop2013 <- aggregate(Population ~ Age, data = data,
subset = Year == 2013 & Age %in% 45:54, sum)[["Population"]]
```

Next we calculate age-adjusted rates using the population distributions from 1999 and 2013. Again we sum Deaths and Populations by age and year for the 45-54 age group and calculate the death rate. Then we calculate the average rate by year using the population distributions to calculate a weighted mean.

```
# age-adjustment from Fig 2.12 (a)
aa_rate_uniform <- aggregate(cbind(Deaths, Population) ~ Age + Year,
data = data, sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population) |>
aggregate(Rate ~ Year, data = _, mean) |>
transform(AA_Rate = Rate/Rate[1])
aa_rate_1999 <- aggregate(cbind(Deaths, Population) ~ Age + Year,
data = data, sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population) |>
aggregate(Rate ~ Year, data = _, function(x)weighted.mean(x, pop1999))
aa_rate_2013 <- aggregate(cbind(Deaths, Population) ~ Age + Year,
data = data, sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population) |>
aggregate(Rate ~ Year, data = _, function(x)weighted.mean(x, pop2013))
```

Now we can make the plot. Notice we find the range of all the data to help set the limits of the y-axis. Also notice we rescale the plot so all lines begin at 1.

```
rng <- range(aa_rate_uniform$Rate/aa_rate_uniform$Rate[1],
aa_rate_1999$Rate/aa_rate_1999$Rate[1],
aa_rate_2013$Rate/aa_rate_2013$Rate[1])
plot(years, aa_rate_uniform$Rate/aa_rate_uniform$Rate[1], type = "l", ylim=rng,
ylab = "age-adjusted death rate, relative to 1999")
lines(years, aa_rate_1999$Rate/aa_rate_1999$Rate[1], lty=2)
lines(years, aa_rate_2013$Rate/aa_rate_2013$Rate[1], lty=3)
text(2003, 1.053, "Using 1999\nage dist", cex=.8)
text(2004, 1.032, "Using 2013\nage dist", cex=.8)
```

The point of this plot is to demonstrate it doesn’t matter how the age-adjustment is done.

The final plot shows age adjusted death rates broken down by sex. This is basically the same code as Fig 2.12 (a) but with male included in the calls to `aggregate()`

. To rescale the y-axis so it starts at 1 we need divide each vector of rates by the respective 1999 value.

```
aa_rate_sex <- aggregate(cbind(Deaths, Population) ~ Age + Year + Male,
data = data, sum,
subset = Age %in% 45:54) |>
transform(Rate = Deaths/Population) |>
aggregate(Rate ~ Year + Male, data = _, mean)
plot(years, aa_rate_sex$Rate[aa_rate_sex$Male == 0]/
aa_rate_sex$Rate[aa_rate_sex$Year == 1999 & aa_rate_sex$Male == 0],
col="red", type = "l",
ylab = "Death rate relative to 1999")
lines(years, aa_rate_sex$Rate[aa_rate_sex$Male == 1]/
aa_rate_sex$Rate[aa_rate_sex$Year == 1999 & aa_rate_sex$Male == 1],
col="blue", type = "l")
text(2011.5, 1.075, "Women", col="red")
text(2010.5, 1.02, "Men", col="blue")
```

Gelman called his code “ugly”, but it’s his code and he understands it. I don’t claim my code is any better, but it’s my code and I understand it.

]]>As we would typically estimate the success probability *p* with the observed success probability \(\hat{p} = \sum_iX_i/n\), we might consider using \(\frac{\hat{p}}{1 – \hat{p}}\) as an estimate of \(\frac{p}{1 – p}\) (the odds). But what are the properties of this estimator? How might we estimate the variance of \(\frac{\hat{p}}{1 – \hat{p}}\)? Moreover, how can we approximate its sampling distribution? Intuiton abandons us, and exact calculation is relatively hopeless, so we have to rely on an approximation. The Delta Method will allow us to obtain reasonable, approximate answers to our questions. (Casella and Berger, p. 240)

Most statistics books that teach the Delta Method work a few examples where they manually derive the standard error of a nonlinear function of some statistic. This requires some calculus and algebra. The result is a closed-form formula we could ostensibly we use in a function to estimate the standard error of a statistic, such as estimated odds, which is a function of an estimated proportion. I want to document how we can use the `deltaMethod()`

function in the {car} package to do this work for us.

Casella and Berger show that the estimated standard error of the odds estimator is \(\frac{\hat{p}}{n(1 – \hat{p})^3}\) (p. 242). If we didn’t know this off hand or have a function available to us, we can use the `deltaMethod()`

function to derive this estimator on-the-fly as we analyze data. For example, let’s say we observe 19 successes out of 30 trials, an estimated probability of about 0.63, but we want to express that as odds and obtain a confidence interval on the estimated odds.

To begin we load the {car} package. Next we need to store our probability estimate in a *named* vector. I gave it the name “p”. After that, we need to estimate the variance of the probability estimate, which in this case is the familiar \(\hat{p}(1 – \hat{p})/n\). Finally we use the `deltaMethod()`

function. The first argument is our named vector containing the estimated probability. The second argument is the function of our estimate expressed as a *character string*. Notice this is the odds. The third argument is the estimated variance of our original estimate.

```
library(car)
p_hat <- c("p" = 19/30)
var_p <- p_hat*(1 - p_hat)/30
deltaMethod(p_hat, g. = "p/(1-p)", vcov. = var_p)
```

```
## Estimate SE 2.5 % 97.5 %
## p/(1 - p) 1.72727 0.65441 0.44466 3.0099
```

So our estimated odds is about 1.73 with a 95% confidence interval of [0.44, 3.01]. The reported standard error agrees with the calculation using the formula provided in Casella and Berger.

`sqrt(p_hat/(30*(1 - p_hat)^3))`

```
## p
## 0.6544077
```

In *Foundations of Statistics for Data Scientists*, Agresti and Kateri use the Delta Method to derive the variance of square root transformed Poisson counts. They show that the square root of a Poisson random variable with a “large mean” has an approximate standard error of 1/2. Again we can use the `deltaMethod()`

function with data to derive this on-the-fly.

Below we simulate 10,000 observations from a Poisson distribution with mean 25. Then we estimate the mean and assign it to a named vector. Finally we use the `deltaMethod()`

function to show the result is indeed about 1/2. Notice we simply have to provide the transformation as a character string in the second argument.

```
set.seed(123)
y <- rpois(10000, 25)
m <- c("m" = mean(y))
deltaMethod(m, g. = "sqrt(m)", vcov. = var(y))
```

```
## Estimate SE 2.5 % 97.5 %
## sqrt(m) 4.99967 0.49747 4.02465 5.9747
```

Of course the `deltaMethod()`

function was really designed to take fitted model objects and estimate the standard error of functions of coefficients. See its help page for a few examples. But I wanted to show it could also be used for more pedestrian textbook examples.

**References**

- Agresti, A. and Kateri, M. (2022)
*Foundations of Statistics for Data Scientists*. CRC Press. - Casella, G. and Berger, R.L. (2002)
*Statistical Inference. 2nd Edition*, Duxbury Press, Pacific Grove. - Fox J, Weisberg S (2019).
*An R Companion to Applied Regression*, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/. - R Core Team (2024).
*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

The following data come from the text *Design and Analysis of Experiments* by Dean and Voss (1999). It involves the lifetime per unit cost of nonrechargeable batteries. Four types of batteries are considered:

- alkaline, name brand
- alkaline, store brand
- heavy duty, name brand
- heavy duty, store brand

We can read the data in from the textbook web site.

```
URL <- "https://corescholar.libraries.wright.edu/cgi/viewcontent.cgi?filename=12&article=1007&context=design_analysis&type=additional"
bat <- read.table(URL, header=TRUE)
names(bat) <- tolower(names(bat))
bat$type <- factor(bat$type)
head(bat)
```

```
## type lpuc order
## 1 1 611 1
## 2 2 923 2
## 3 1 537 3
## 4 4 476 4
## 5 1 542 5
## 6 1 593 6
```

A quick look at the means suggests the alkaline store brand battery seems like the best battery for the money.

```
means <- tapply(bat$lpuc, bat$type, mean)
means
```

```
## 1 2 3 4
## 570.75 860.50 433.00 496.25
```

We might want to make comparisons between these means. Three such comparisons are as follows:

- compare battery duty (alkaline vs heavy duty)
- compare battery brand (name brand versus store brand)
- compare the interaction (levels 1/4 vs levels 2/3)

These are the comparisons presented in the text (p. 171). We can make these comparisons “by hand”.

```
# compare battery duty
mean(means[1:2]) - mean(means[3:4])
```

`## [1] 251`

```
# compare battery brand
mean(means[c(1,3)]) - mean(means[c(2,4)])
```

`## [1] -176.5`

```
# compare interaction
mean(means[c(1,4)]) - mean(means[c(2,3)])
```

`## [1] -113.25`

We can also make these comparisons using a *contrast* matrix. Below we create the contrast as a matrix object and name it K. Then we use matrix multiplication to calculate the same comparisons above.

```
K <- matrix(c(1/2, 1/2, -1/2, -1/2,
1/2, -1/2, 1/2, -1/2,
1/2, -1/2, -1/2, 1/2),
byrow = T, ncol = 4)
K %*% means
```

```
## [,1]
## [1,] 251.00
## [2,] -176.50
## [3,] -113.25
```

This particular contrast matrix is *orthogonal*. “Two contrasts are orthogonal if the sum of the products of corresponding coefficients (i.e. coefficients for the same means) adds to zero.” (source) We can show this using the `crossprod()`

function.

`crossprod(K[1,],K[2,])`

```
## [,1]
## [1,] 0
```

`crossprod(K[2,],K[3,])`

```
## [,1]
## [1,] 0
```

`crossprod(K[1,],K[3,])`

```
## [,1]
## [1,] 0
```

Obviously we would like to calculate standard errors and confidence intervals for these comparisons. One way is to fit a model using `lm()`

and do the comparisons as a follow-up using a package such as {multcomp}. To do this we use the `glht()`

function with our contrast, K.

```
library(multcomp)
m <- lm(lpuc ~ type, data = bat)
comp <- glht(m, linfct = mcp(type = K))
confint(comp)
```

```
##
## Simultaneous Confidence Intervals
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: lm(formula = lpuc ~ type, data = bat)
##
## Quantile = 2.7484
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## 1 == 0 251.0000 184.1334 317.8666
## 2 == 0 -176.5000 -243.3666 -109.6334
## 3 == 0 -113.2500 -180.1166 -46.3834
```

Another way is to *encode* the contrasts directly in our model. We can do that using the capital `C()`

function. The only difference here is that we need to transpose the matrix. The comparisons need to be defined on the columns instead of the rows. We could use the `t()`

function to transpose K, but we’ll go ahead and create a new matrix to make this clear.

```
K2 <- matrix(c(1/2, 1/2, -1/2, -1/2,
1/2, -1/2, 1/2, -1/2,
1/2, -1/2, -1/2, 1/2),
ncol = 3)
```

Now we refit the model using the contrast in the model formula. Notice the coefficients are the same comparisons we calculated above. The intercept is the grand mean of lpuc, (i.e. `mean(bat$lpuc)`

).

```
m2 <- lm(lpuc ~ C(type, K2), data = bat)
summary(m2)
```

```
##
## Call:
## lm(formula = lpuc ~ C(type, K2), data = bat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.50 -33.56 -18.12 38.19 72.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 590.12 12.16 48.511 3.85e-15 ***
## C(type, K2)1 251.00 24.33 10.317 2.55e-07 ***
## C(type, K2)2 -176.50 24.33 -7.255 1.01e-05 ***
## C(type, K2)3 -113.25 24.33 -4.655 0.000556 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.66 on 12 degrees of freedom
## Multiple R-squared: 0.9377, Adjusted R-squared: 0.9222
## F-statistic: 60.24 on 3 and 12 DF, p-value: 1.662e-07
```

And now we can use `confint()`

to get the confidence intervals.

`confint(m2)`

```
## 2.5 % 97.5 %
## (Intercept) 563.6202 616.62977
## C(type, K2)1 197.9905 304.00954
## C(type, K2)2 -229.5095 -123.49046
## C(type, K2)3 -166.2595 -60.24046
```

These are narrower than what we got with {multcomp}. That’s because {multcomp} uses a family-wise confidence interval that adjusts for making multiple comparisons.

Another property of orthogonal contrasts is that their estimated means are uncorrelated. We can see this by calling `vcov()`

on the fitted model object that directly uses the contrast matrix we created.

`zapsmall(vcov(m2))`

```
## (Intercept) C(type, K2)1 C(type, K2)2 C(type, K2)3
## (Intercept) 147.9818 0.0000 0.0000 0.0000
## C(type, K2)1 0.0000 591.9271 0.0000 0.0000
## C(type, K2)2 0.0000 0.0000 591.9271 0.0000
## C(type, K2)3 0.0000 0.0000 0.0000 591.9271
```

We can also use the {emmeans} package to make these comparisons as follows using the original model.

```
library(emmeans)
emm_out <- emmeans(m, "type")
contrast(emm_out, list(c1 = c(1, 1, -1, -1)/2,
c2 = c(1, -1, 1, -1)/2,
c3 = c(1, -1, -1, 1)/2)) |>
confint()
```

```
## contrast estimate SE df lower.CL upper.CL
## c1 251 24.3 12 198 304.0
## c2 -176 24.3 12 -230 -123.5
## c3 -113 24.3 12 -166 -60.2
##
## Confidence level used: 0.95
```

]]>**Variable names**

Variable names cannot begin with a digit or underscore, and if they begin with a period they cannot be followed by a number. But we can bend these rules by quoting the names with backticks.

```
`_evil` <- "probably not wise"
`_evil`
```

`## [1] "probably not wise"`

```
`.666_number of the beast` <- sqrt(666^2)
`.666_number of the beast`
```

`## [1] 666`

`rm(`_evil`, `.666_number of the beast`)`

**Attributes**

Attributes can be attached to any R object except NULL. They can be useful for storing metadata among many other things. For example, add a source for a dataset.

```
d <- VADeaths
attr(d, "source") <- "Molyneaux, L., Gilliam, S. K., and Florant, L. C.(1947) Differences in Virginia death rates by color, sex, age, and rural or urban residence. American Sociological Review, 12, 525–535."
```

To see the source:

`attr(d, "source")`

`## [1] "Molyneaux, L., Gilliam, S. K., and Florant, L. C.(1947) Differences in Virginia death rates by color, sex, age, and rural or urban residence. American Sociological Review, 12, 525–535."`

To see all attributes of an object:

`attributes(d)`

```
## $dim
## [1] 5 4
##
## $dimnames
## $dimnames[[1]]
## [1] "50-54" "55-59" "60-64" "65-69" "70-74"
##
## $dimnames[[2]]
## [1] "Rural Male" "Rural Female" "Urban Male" "Urban Female"
##
##
## $source
## [1] "Molyneaux, L., Gilliam, S. K., and Florant, L. C.(1947) Differences in Virginia death rates by color, sex, age, and rural or urban residence. American Sociological Review, 12, 525–535."
```

To remove an attribute:

`attr(d, "source") <- NULL`

Not all attributes are displayed when called on an object. For example, after fitting a linear model, it appears there are only two attributes.

```
m <- lm(dist ~ speed, data = cars)
attributes(m)
```

```
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
```

However, elements of the model object also have attributes. For example, the terms element has 10 attributes.

```
out <- attributes(m$terms)
length(out)
```

`## [1] 10`

`names(out)`

```
## [1] "variables" "factors" "term.labels" "order" "intercept"
## [6] "response" "class" ".Environment" "predvars" "dataClasses"
```

`attr(m$terms, "factors")`

```
## speed
## dist 0
## speed 1
```

**The colon operator**

I often forget the colon operator can work with decimal values.

`2.5:10.5`

`## [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5`

And can go backwards:

`10.2:1.2`

`## [1] 10.2 9.2 8.2 7.2 6.2 5.2 4.2 3.2 2.2 1.2`

**zero length vectors**

The sum of zero length vector is 0, but the product of a zero length vector is 1.

```
x <- numeric()
length(x)
```

`## [1] 0`

`sum(x)`

`## [1] 0`

`prod(x)`

`## [1] 1`

This is ensures expected behavior when working with sums and products:

```
# 12 + 0
sum(12, x)
```

`## [1] 12`

```
# 12 * 1
prod(12, x)
```

`## [1] 12`

**.Machine**

The `.Machine`

variable holds information about the numerical characteristics of your machine. For example, the largest integer my machine can represent:

`.Machine$integer.max`

`## [1] 2147483647`

If I add 1 to that, the result is numeric, not an integer.

```
x <- .Machine$integer.max
x2 <- x + 1
is.integer(x2)
```

`## [1] FALSE`

If I add 1L (an explicit integer) to that, the result is a warning and a NA. My machine cannot represent that integer.

`x2 <- x + 1L`

`## Warning in x + 1L: NAs produced by integer overflow`

`x2`

`## [1] NA`

**Recoding factors**

There are several convenience functions in other packages for recoding variables such as `recode`

in the {car} package, `case_when`

in {dplyr}, and a bunch of functions in the {forcats} package. But it’s good to remember how to use base R to recode factors. Create a list with the recoding definitions and assign to the levels of the factor.

```
g <- sample(letters[1:5], 30, replace = TRUE)
g <- factor(g)
g
```

```
## [1] e c d c d c b a e c a d e e d b e c b c b c b d d c b a d e
## Levels: a b c d e
```

Put “a” and “b” into one group, “c” and “d” into another group, and keep “e” in it’s own group.

```
lst <- list("A" = c("a", "b"),
"B" = c("c", "d"),
"C" = "e")
levels(g) <- lst
g
```

```
## [1] C B B B B B A A C B A B C C B A C B A B A B A B B B A A B C
## Levels: A B C
```

If we like we can add an attribute to store the definition.

```
attr(g, "recoding") <- c("A = {ab}, B = {cd}, C = {e}")
g
```

```
## [1] C B B B B B A A C B A B C C B A C B A B A B A B B B A A B C
## attr(,"recoding")
## [1] A = {ab}, B = {cd}, C = {e}
## Levels: A B C
```

**lists can have dimensions**

Something more interesting than applicable is that lists can have dimensions.

```
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
Xsq <- chisq.test(M) # produces 9 element list
Xsq <- unclass(Xsq) # remove htest class
dim(Xsq) <- c(3,3)
Xsq
```

```
## [,1] [,2] [,3]
## [1,] 30.07015 "Pearson's Chi-squared test" numeric,6
## [2,] 2 "M" table,6
## [3,] 2.953589e-07 table,6 table,6
```

`Xsq[1,3]`

```
## [[1]]
## A B C
## A 703.6714 319.6453 533.6834
## B 542.3286 246.3547 411.3166
```

**Environments**

We are not restricted to creating objects in the Global Environment. We can create our own environments using the `new.env()`

function and then create objects in that environment. We can use the dollar sign operator or the `assign()`

function.

```
e1 <- new.env()
e1$mod <- lm(dist ~ speed, data = cars)
e1$cumTotal <- function(x)tail(cumsum(x), n = 1)
assign("vals", c(20, 23, 34, 19), envir = e1)
ls(e1)
```

`## [1] "cumTotal" "mod" "vals"`

`ls() # list objects in Global Environment`

`## [1] "d" "e1" "g" "lst" "m" "M" "out" "x" "x2" "Xsq"`

We can access objects in our environment using the dollar sign operator or the `get()`

and `mget()`

functions.

`e1$cumTotal(c(2,4,6))`

`## [1] 12`

`get("vals", envir = e1)`

`## [1] 20 23 34 19`

`mget(c("mod", "vals"), envir = e1) # get more than one object`

```
## $mod
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
##
##
## $vals
## [1] 20 23 34 19
```

We can save the environment and reload it in a future session.

```
save(e1, file = "e1.Rdata")
rm(e1)
load(file = "e1.Rdata")
```

We can also change the environment associated with an object that was created in the Global Environment.

```
f <- function(x)(vals + 1000) # vals object defined in e1 environment
environment(f) <- e1
f
```

```
## function(x)(vals + 1000)
## <environment: 0x000001b5ec98d4e0>
```

`f()`

`## [1] 1020 1023 1034 1019`

Notice if we remove the environment using `rm()`

, the function still remains in that environment and we have access to its objects

```
rm(e1)
f
```

```
## function(x)(vals + 1000)
## <environment: 0x000001b5ec98d4e0>
```

`f()`

`## [1] 1020 1023 1034 1019`

`rm(e1)`

simply removes the *binding* between the symbol “e1” and structure that contains the objects. Since the environment can be reached as the environment of `f()`

, it remains available.

**Brackets and Dollar Signs**

I found this sentence enlightening: “One way of describing the behavior of the single bracket operator is that the type of the return value matches the type of the value it is applied to.” (p. 28) I like this in favor of metaphors involving trains.

```
lst <- list(a1 = 1:5, b = c("d", "g"), c = 99)
lst["a1"] # returns a list
```

```
## $a1
## [1] 1 2 3 4 5
```

`[[`

and `$`

extract single values.

`lst[["a1"]]`

`## [1] 1 2 3 4 5`

`lst$a1`

`## [1] 1 2 3 4 5`

The `$`

operator supports partial matching.

`lst$a`

`## [1] 1 2 3 4 5`

The `[`

and `[[`

operators support expressions, but not partial matching.

```
ans <- "c"
lst[ans]
```

```
## $c
## [1] 99
```

`lst[[ans]]`

`## [1] 99`

If names are duplicated in named vectors, then only the value corresponding to the first one is returned when subsetting with brackets.

```
x <- c("a" = 1, "a" = 2)
x["a"]
```

```
## a
## 1
```

The `%in%`

operator can be useful to get all elements with the same name.

`x[names(x) %in% "a"]`

```
## a a
## 1 2
```

**Matrix indexing**

I don’t work with arrays that often, but when I do I often forget that I can index them with a matrix. Below I extract the value in row 1, column 4, from each of the 3 layers of the iris3 array.

```
m <- matrix(c(1,4,1,
1,4,2,
1,4,3),
ncol = 3, byrow = TRUE)
iris3[m]
```

`## [1] 0.2 1.4 2.5`

Of course we can get the same result (in this case) using subsetting indices.

`iris3[1,4,]`

```
## Setosa Versicolor Virginica
## 0.2 1.4 2.5
```

**Negative subscripts**

Negative subscripts can appear on the *left side* of assignment.

```
x <- 1:10
x[-(2:4)] <- 99
x
```

`## [1] 99 2 3 4 99 99 99 99 99 99`

**Subsetting without dimensions**

Use empty double brackets to select all elements and not change any attributes.

```
x <- matrix(10:1, ncol = 2)
x
```

```
## [,1] [,2]
## [1,] 10 5
## [2,] 9 4
## [3,] 8 3
## [4,] 7 2
## [5,] 6 1
```

```
x[] <- sort(x)
x
```

```
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
```

]]>`lm()`

and `glm()`

. It’s not that {rms} is hard to use. In fact it’s quite easy to use. But using functions such as `summary()`

and `anova()`

with {rms} models produces very different output than what you get with a base R `lm()`

model and can seem baffling to the uninitiated. In this post I hope to explain a little of what {rms} is doing by comparing it to the more traditional approaches in R so commonly taught in classrooms and textbooks.
To begin, let’s load the `gala`

data from the {faraway} package. The `gala`

data set contains data on species diversity on the Galapagos Islands. Below we use the base R `lm()`

function to model the number of plant *Species* found on the island as function of the *Area* of the island (km\(^2\)), the highest *Elevation* of the island (m), the distance from the Nearest island (km), the distance from Santa *Cruz* island (km), and the area of the *Adjacent* island (square km). This is how I and thousands of others learned to do regression in R. Of course we use the familiar `summary()`

function on our model object to see the model coefficients, marginal tests, the residual standard error, R squared, etc.

```
library(faraway)
data("gala")
m <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala)
summary(m)
```

```
##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
```

Now let’s fit the same model using the {rms} package. For this we use the `ols()`

function. Notice we can use R’s formula syntax as usual. One additional argument we need to use that will come in handy in a few moments is `x = TRUE`

. This stores the predictor variables with the model fit as a design matrix. Notice we don’t need to call `summary()`

on the saved model object. We simply print it.

```
library(rms)
mr <- ols(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data = gala, x = TRUE)
mr
```

```
## Linear Regression Model
##
## ols(formula = Species ~ Area + Elevation + Nearest + Scruz +
## Adjacent, data = gala, x = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 30 LR chi2 43.55 R2 0.766
## sigma60.9752 d.f. 5 R2 adj 0.717
## d.f. 24 Pr(> chi2) 0.0000 g 105.768
##
## Residuals
##
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 7.0682 19.1542 0.37 0.7154
## Area -0.0239 0.0224 -1.07 0.2963
## Elevation 0.3195 0.0537 5.95 <0.0001
## Nearest 0.0091 1.0541 0.01 0.9932
## Scruz -0.2405 0.2154 -1.12 0.2752
## Adjacent -0.0748 0.0177 -4.23 0.0003
```

The coefficient tables and summaries of residuals are the same. Likewise both return R-squared and Adjusted R-squared. The residual standard error from the `lm()`

output is called `sigma`

in the `ols()`

output.

The `ols`

output also includes a “g” statistic. This is Gini’s mean difference and measures the dispersion of predicted values. It’s an alternative to standard deviation. We can use the {Hmisc} function `GiniMd`

to calculate Gini’s mean difference for the model fit with `lm()`

as follows.

`Hmisc::GiniMd(predict(m))`

`## [1] 105.7677`

Whereas the summary output for `lm()`

includes an F test for the null that all of the predictor coefficients equal 0, the `ols()`

function reports a Likelihood Ratio Test. This tests the same null hypothesis using a ratio of likelihoods. We can calculate this test for the `lm()`

model using the `logLik()`

function. First we subtract the log likelihood of a model with no predictors from the log likelihood for the original model, and then multiply by 2. Apparently, according to Wikipedia, multiplying by 2 ensures mathematically that the test statistic converges to a chi-square distribution if the null is true. If the null is true, we expect this different of log likelihoods (or ratio, according to the quotient rule of logs) to be about 1. This statistic is very large. Clearly at least one of the predictor coefficients is not 0.

```
LRchi2 <- (logLik(m) - logLik(lm(gala$Species ~ 1)))*2
LRchi2
```

`## 'log Lik.' 43.55341 (df=7)`

If we wanted to calculate the p-value that appears in the `ols()`

output, we can use the `pchisq()`

function. Obviously the p-value in the `ols()`

output is rounded.

`pchisq(LRchi2, df = 7, lower.tail = FALSE)`

`## 'log Lik.' 2.60758e-07 (df=7)`

Now let’s use the `anova()`

function on both model objects and compare the output. For the `lm()`

object, we get sequential partial F tests using Type I sums of squares. Each line tests the null hypothesis that adding the listed predictor to the previous model without it explains no additional variability. So the Area line compares a model with just an intercept to a model with an intercept and Area. The Elevation line compares a model with an intercept and Area to a model with an intercept, Area, and Elevation. And so on. Small p-values are evidence against the null.

`anova(m)`

```
## Analysis of Variance Table
##
## Response: Species
## Df Sum Sq Mean Sq F value Pr(>F)
## Area 1 145470 145470 39.1262 1.826e-06 ***
## Elevation 1 65664 65664 17.6613 0.0003155 ***
## Nearest 1 29 29 0.0079 0.9300674
## Scruz 1 14280 14280 3.8408 0.0617324 .
## Adjacent 1 66406 66406 17.8609 0.0002971 ***
## Residuals 24 89231 3718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Calling `anova()`

on the `ols()`

object produces a series of partial F tests using Type II sums of squares. Each line tests the null hypothesis that adding the listed predictor to a model with all the other listed predictors already in it explains no additional variability. So the Area line compares a model with all the other predictors to a model with all the other predictors and Area. The Elevation line compares a model with all the other predictors to a model with all the other predictors and Elevation. The line labeled REGRESSION is the F Test reported in the summary output for the `lm()`

model.

`anova(mr) `

```
## Analysis of Variance Response: Species
##
## Factor d.f. Partial SS MS F P
## Area 1 4.237718e+03 4.237718e+03 1.14 0.2963
## Elevation 1 1.317666e+05 1.317666e+05 35.44 <.0001
## Nearest 1 2.797576e-01 2.797576e-01 0.00 0.9932
## Scruz 1 4.635787e+03 4.635787e+03 1.25 0.2752
## Adjacent 1 6.640639e+04 6.640639e+04 17.86 0.0003
## REGRESSION 5 2.918500e+05 5.837000e+04 15.70 <.0001
## ERROR 24 8.923137e+04 3.717974e+03
```

To run these same Partial F tests for the `lm()`

object we can use the base R `drop1()`

function.

`drop1(m, test = "F")`

```
## Single term deletions
##
## Model:
## Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 89231 251.93
## Area 1 4238 93469 251.33 1.1398 0.2963180
## Elevation 1 131767 220998 277.14 35.4404 3.823e-06 ***
## Nearest 1 0 89232 249.93 0.0001 0.9931506
## Scruz 1 4636 93867 251.45 1.2469 0.2752082
## Adjacent 1 66406 155638 266.62 17.8609 0.0002971 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The {rms} package offers a convenient plot method for `anova()`

objects that “draws dot charts depicting the importance of variables in the model” (from the `anova.rms`

help page). The default importance measure is the chi-square statistic for each factor minus its degrees of freedom. There are several other importance measures available. See the `anova.rms`

help page for the `what`

argument.

`plot(anova(mr))`

Earlier we noted that we don’t use the `summary()`

function on `ols()`

model objects to see a model summary. That doesn’t mean there isn’t a `summary()`

method available. There is, but it does something entirely different than the summary method for `lm()`

objects. If we try it right now we get an error message:

`summary(mr)`

`## Error in summary.rms(mr): adjustment values not defined here or with datadist for Area Elevation Nearest Scruz Adjacent`

Notice what it says: “adjustment values not defined here or with datadist”. This tells us we need to define “adjustment values” to use the `summary()`

function. Instead of summarizing the model, the {rms} summary function when called on {rms} model objects returns a “summary of the effects of each factor”. Let’s demonstrate what this means.

The easiest way to define adjustment values is to use the `datadist()`

function on our data frame and then set the global datadist option using the base R `options()`

function. Harrell frequently assigns datadist results to “d” so we do the same. (Also, the help page for datadist states, “The best method is probably to run datadist once before any models are fitted, storing the distribution summaries for all potential variables.” I elected to wait for presentation purposes.)

```
d <- datadist(gala)
options(datadist = "d")
```

If we print “d” we see adjustment levels for all variables in the model.

`d`

```
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Low:effect 13.0 7.25 0.2575 97.75 0.800 11.025 0.5200
## Adjust to 42.0 18.00 2.5900 192.00 3.050 46.650 2.5900
## High:effect 96.0 32.25 59.2375 435.25 10.025 81.075 59.2375
## Low:prediction 2.0 1.45 0.0390 47.35 0.445 0.490 0.1000
## High:prediction 319.1 85.40 782.6215 1229.40 40.205 193.905 782.6215
## Low 2.0 0.00 0.0100 25.00 0.200 0.000 0.0300
## High 444.0 95.00 4669.3200 1707.00 47.400 290.200 4669.3200
```

Now let’s call `summary()`

on our {rms} model object and see what it returns.

`summary(mr)`

```
## Effects Response : Species
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## Area 0.2575 59.238 58.980 -1.411900 1.3225 -4.1413 1.3176
## Elevation 97.7500 435.250 337.500 107.820000 18.1110 70.4400 145.2000
## Nearest 0.8000 10.025 9.225 0.084353 9.7244 -19.9860 20.1550
## Scruz 11.0250 81.075 70.050 -16.849000 15.0890 -47.9910 14.2930
## Adjacent 0.5200 59.238 58.718 -4.392400 1.0393 -6.5374 -2.2473
```

This produces an interquartile range “Effects” summary for all predictors. For example, in the first row we see the effect of Area is about -1.41. To calculate this we predict Species when Area = 59.238 (High, 75th percentile) and when Area = 0.2575 (Low, 25th percentile) and take the difference in predicted species. For each prediction, all other variables are held at their “Adjust to” level as shown above when we printed the datadist object. In addition to the effect, the standard error (SE) of the effect and a 95% confidence interval on the effect is returned. In this case we’re not sure if the effect is positive or negative.

To calculate this effect measure using our `lm()`

model object we can use the `predict()`

function to make two predictions and then take the difference using the `diff()`

function. Notice all the values in the newdata argument come from the datadist object above. The two values for Area are the “Low:effect” and “High:effect” values. The other values from the “Adjust to” row.

```
# How to get Area effect in summary(mr) output = -1.411900
p <- predict(m, newdata = data.frame(Area = c(0.2575, 59.238),
Elevation=97.75,
Nearest=3.05,
Scruz=46.65,
Adjacent=2.59))
p
```

```
## 1 2
## 26.90343 25.49153
```

`diff(p)`

```
## 2
## -1.411895
```

To get the standard error and confidence interval we can do the following. (Recall 58.980 is the IQR of Area.)

```
se <- sqrt((58.980 * vcov(m)["Area","Area"] * 58.980))
se
```

`## [1] 1.32247`

`diff(p) + c(-1,1) * se * qt(0.975, df = 24)`

`## [1] -4.141340 1.317549`

This is made a little easier using the `emmeans()`

function in the {emmeans} package. We use the `at`

argument to specify at what values we wish to make predictions for Area. All other values are held at their median by setting `cov.reduce = median`

. Sending the result to the {emmeans} function `contrast`

with argument “revpairwise” says to subtract the estimated means in reverse order. Finally we pipe into the `confint()`

to replicate the IQR effect estimate for Area that we saw in the `ols()`

summary.

```
library(emmeans)
emmeans(m, specs = "Area", at = list(Area = c(0.2575, 59.238)),
cov.reduce = median) |>
contrast("revpairwise") |>
confint()
```

```
## contrast estimate SE df lower.CL upper.CL
## Area59.238 - Area0.2575 -1.41 1.32 24 -4.14 1.32
##
## Confidence level used: 0.95
```

There is also a plot method for {rms} summary objects. It plots the Effects and the 90, 95, and 99 percent confidence intervals using different shades of blue. Below we see the Area effect of -1.41 with relatively tight confidence intervals hovering around 0.

`plot(summary(mr))`

Harrell advocates using non-linear effects in the form of regression splines. This makes a lot of sense when you pause and consider how many effects in real life are truly linear. Not many. Very few associations in nature indefinitely follow a straight line relationship. Fortunately, we can easily implement regression splines in R and specify how much non-linearity we want to entertain. We do this either in the form of degrees of freedom or knots. More of both means more non-linearity. I personally think of it as the number of times the relationship might change direction.

One way to implement regression splines in R is via the `ns()`

function in the {splines} package, which comes installed with R. Below we fit a model that allows the effect of Nearest to change directions three times by specifying `df=3`

. We might think of this as sort of like using a 3-degree polynomial to model Nearest (but that’s not what we’re doing). The summary shows three coefficients for Nearest, neither of which have any interpretation.

```
library(splines)
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, df = 3) +
Scruz + Adjacent,
data = gala)
summary(m2)
```

```
##
## Call:
## lm(formula = Species ~ Area + Elevation + ns(Nearest, df = 3) +
## Scruz + Adjacent, data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.857 -28.285 -0.775 25.498 163.069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.18825 27.39386 0.481 0.6350
## Area -0.03871 0.02098 -1.845 0.0785 .
## Elevation 0.35011 0.05086 6.883 6.51e-07 ***
## ns(Nearest, df = 3)1 -168.44016 68.41261 -2.462 0.0221 *
## ns(Nearest, df = 3)2 -56.97623 57.67426 -0.988 0.3339
## ns(Nearest, df = 3)3 38.61120 46.22137 0.835 0.4125
## Scruz -0.13735 0.19900 -0.690 0.4973
## Adjacent -0.08254 0.01743 -4.734 0.0001 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.11 on 22 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.7689
## F-statistic: 14.78 on 7 and 22 DF, p-value: 5.403e-07
```

To decide whether we should keep this non-linear effect, we could use the `anova()`

function to compare this updated more complex model to the original model with only linear effects. The null of the test is that both models are equally adequate. It appears there is some evidence that this non-linearity improves the model.

`anova(m, m2)`

```
## Analysis of Variance Table
##
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + ns(Nearest, df = 3) + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 89231
## 2 22 66808 2 22424 3.6921 0.04144 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

To entertain non-linear effects in {rms} use the `rcs()`

function, which stands for restricted cubic splines. Instead of degrees of freedom we specify knots. Three degrees of freedom corresponds to four knots.

```
mr2 <- ols(Species ~ Area + Elevation + rcs(Nearest, 4) +
Scruz + Adjacent,
data = gala, x = TRUE)
mr2
```

```
## Linear Regression Model
##
## ols(formula = Species ~ Area + Elevation + rcs(Nearest, 4) +
## Scruz + Adjacent, data = gala, x = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 30 LR chi2 51.32 R2 0.819
## sigma55.9556 d.f. 7 R2 adj 0.762
## d.f. 22 Pr(> chi2) 0.0000 g 107.069
##
## Residuals
##
## Min 1Q Median 3Q Max
## -90.884 -29.870 -3.715 28.019 163.497
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 14.1147 30.2230 0.47 0.6451
## Area -0.0368 0.0212 -1.74 0.0964
## Elevation 0.3444 0.0512 6.73 <0.0001
## Nearest 3.6081 16.9795 0.21 0.8337
## Nearest' -714.5143 891.8922 -0.80 0.4316
## Nearest'' 1001.4929 1207.4895 0.83 0.4158
## Scruz -0.2286 0.1978 -1.16 0.2602
## Adjacent -0.0806 0.0175 -4.60 0.0001
```

Notice we get three coefficients for Nearest that differ in value from our `lm()`

result. A brief explanation of why that is can be found on datamethods.org. This paper provides a deeper explanation.

Calling the {rms} `anova()`

method on the model object returns a test for nonlinearity under the Nearest test, labeled “Nonlinear”. Notice the resulting p-value is slightly higher and exceeds 0.05.

`anova(mr2)`

```
## Analysis of Variance Response: Species
##
## Factor d.f. Partial SS MS F P
## Area 1 9446.295 9446.295 3.02 0.0964
## Elevation 1 141872.398 141872.398 45.31 <.0001
## Nearest 3 20348.931 6782.977 2.17 0.1208
## Nonlinear 2 20348.651 10174.326 3.25 0.0580
## Scruz 1 4182.177 4182.177 1.34 0.2602
## Adjacent 1 66204.433 66204.433 21.14 0.0001
## REGRESSION 7 312198.651 44599.807 14.24 <.0001
## ERROR 22 68882.715 3131.033
```

To replicate the {rms} `anova()`

output using `lm()`

, we need to change `ns()`

arguments. The following R code comes from this hbiostat.org page

```
w <- rcs(gala$Nearest, 4)
kn <- attr(w, 'parms')
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, knots = kn[2:3],
Boundary.knots = c(kn[1], kn[4]))
+ Scruz + Adjacent, data = gala)
anova(m, m2)
```

```
## Analysis of Variance Table
##
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + ns(Nearest, knots = kn[2:3], Boundary.knots = c(kn[1],
## kn[4])) + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 89231
## 2 22 68883 2 20349 3.2495 0.05801 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Which approach is better: `ns()`

or `rcs()`

? I don’t pretend to know and I’m not sure it really matters. The substance of the results doesn’t differ much. Yes, we saw a p-value jump from 0.04 to 0.05, but we know better than to make hard binary decisions based on a p-value.

Since interpreting non-linear effect coefficients is all but impossible, we turn to visualization. When fitting a non-linear model with `lm()`

, the {effects} and {ggeffects} packages are fantastic for creating effect plots.

```
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, df = 3) +
Scruz + Adjacent, gala)
library(effects)
Effect("Nearest", m2) |> plot()
```

```
library(ggeffects)
ggeffect(m2, "Nearest") |> plot()
```

In both cases we see the effect of Nearest is negative for values 0 – 20, but then seems to increase for values above 20.

To get a similar plot for an {rms} model, we can use the `Predict()`

function (note the capital “P”; this is an {rms} function) and then pipe into `plot()`

. You can also pipe into `ggplot()`

and `plotp()`

to get ggplot2 and plotly plots, respectively.

`Predict(mr2, Nearest) |> plot() `

To get all effect plots in one go, we can use the {effects} function `allEffects()`

with the `lm()`

model.

`allEffects(m2) |> plot()`

To do the same with {rms}, we use the `Predict()`

function with no predictors specified. Notice the y-axis has the same limits for all plots.

`Predict(mr2) |> plot()`

A linear model makes several assumptions, two of which are constant variance of residuals and normality of residuals. The `plot()`

method for `lm()`

objects makes this easy to assess graphically.

```
op <- par(mfrow = c(2,2))
plot(m2)
```

`par(op)`

The left two plots assess constant variance using two different types of residuals. The upper right plot assess normality. The bottom right helps identify influential observations. The islands of Isabela and Santa Cruz stand out as either not being well fit by the model or unduly influencing the fit of the model.

We can use the `plot.lm()`

method on the `ols()`

model, but only the first one (i.e., `which = 1`

).

`plot(mr2, which = 1)`

To create a QQ plot to assess normality of residuals we need to extract the residuals using the residuals() function and plot with `qqnorm()`

and `qqline()`

```
r <- residuals(mr2, type="student")
qqnorm(r)
qqline(r)
```

To check for possible influential observations for an `lm()`

model we can use the base R function `influence.measures()`

and its associated `summary()`

method. See `?influence.measures`

for a breakdown of what is returned. The columns that begin “dfb” are the DFBETA statistics. DFBETAS indicate the effect that deleting each observation has on the estimates of the regression coefficients. That’s why there’s a DFBETA for every coefficient in the model.

`summary(influence.measures(m2))`

```
## Potentially influential observations of
## lm(formula = Species ~ Area + Elevation + ns(Nearest, df = 3) + Scruz + Adjacent, data = gala) :
##
## dfb.1_ dfb.Area dfb.Elvt dfb.n(N,d=3)1 dfb.n(N,d=3)2 dfb.n(N,d=3)3
## Darwin -0.02 0.07 -0.08 0.05 -0.10 -0.10
## Fernandina 0.14 0.19 -0.15 0.02 0.00 0.03
## Genovesa -0.07 -0.08 0.12 0.13 -0.15 -0.43
## Isabela 0.05 -18.76_* 4.55_* -1.69_* -1.12_* 0.94
## SanCristobal -0.09 -0.11 0.18 -0.18 0.22 0.49
## SantaCruz 0.43 -1.03_* 1.58_* -0.57 -0.96 -0.38
## Wolf 0.00 0.00 0.00 0.01 0.00 0.00
## dfb.Scrz dfb.Adjc dffit cov.r cook.d hat
## Darwin 0.48 0.00 0.62 2.36_* 0.05 0.48
## Fernandina -0.07 -0.91 -1.51 27.71_* 0.30 0.95_*
## Genovesa 0.14 -0.10 -0.47 3.16_* 0.03 0.57
## Isabela -0.54 -0.78 -26.78_* 0.45 50.94_* 0.98_*
## SanCristobal -0.18 -0.12 0.58 2.56_* 0.04 0.50
## SantaCruz -0.17 -1.16_* 2.33_* 0.01 0.36 0.21
## Wolf 0.03 0.00 0.04 2.21_* 0.00 0.34
```

The {rms} package offers the `which.influence()`

and `show.influence()`

tandem to identify observations (via DFBETAS) that effect regression coefficients with their removal. Below we set `cutoff = 0.4`

. An asterisk is placed next to a variable when any of the coefficients associated with that variable change by more than 0.4 standard errors upon removal of the observation. Below we see that removing the Isabela observation changes four of the coefficients. (Also, this is why we set `x = TRUE`

in our call to `ols()`

, so we could use these functions.) The values displayed are the observed values for each observation.

```
w <- which.influence(mr2, cutoff = 0.4)
show.influence(w, gala)
```

```
## Count Area Elevation Nearest Scruz Adjacent
## Darwin 1 2.33 168 34.1 *290.2 2.85
## Fernandina 1 634.49 1494 4.3 95.3 *4669.32
## Isabela 4 *4669.32 *1707 0.7 * 28.1 * 634.49
## Pinta 3 * 59.56 * 777 29.1 119.6 * 129.49
## SanSalvador 3 * 572.33 * 906 0.2 19.8 * 4.89
## SantaCruz 3 * 903.82 * 864 0.6 0.0 * 0.52
## SantaFe 1 24.08 259 *16.5 16.5 0.52
## SantaMaria 5 * 170.92 * 640 * 2.6 49.2 * 0.10
```

Got a question? Harrell runs a discussion board for the {rms} package.

Chapter 14 of these course notes by Thomas Love are also helpful.

And here’s a nice blog post called An Introduction to the Harrell“verse”: Predictive Modeling using the Hmisc and rms Packages by Nicholas Ollberding.

Eventually you may want to buy his book, Regression Modeling Strategies. It’s not cheap, but it’s not outrageous either. Considering how much information, advice, code, examples, and references it contains, I think it’s a bargain. It’s a text that will provide many months, if not years, of study.

]]>So what would I rather do?

- Flip 3 coins and win if all match
- Roll 3 dice and win if none match

I would rather do the one with the highest probability of happening. So let’s calculate that using R.

There are two ways to solve problems like this. One is to use the appropriate probability distribution and calculate it mathematically. The other is to enumerate all possible outcomes and find the proportion of interest. Let’s do the latter first.

We’ll use the `expand.grid`

function to create a data frame of all possible outcomes of flipping 3 coins. All we have to do is give it 3 vectors representing coins. It’s such a small sample space we can eyeball it and see the probability of getting all heads or all tails is 2/8 or 0.25.

```
s1 <- expand.grid(c("H", "T"), c("H", "T"), c("H", "T"))
s1
```

```
## Var1 Var2 Var3
## 1 H H H
## 2 T H H
## 3 H T H
## 4 T T H
## 5 H H T
## 6 T H T
## 7 H T T
## 8 T T T
```

If we wanted to use R to determine the proportion, we could use `apply`

to apply a function to each row and return the number of unique values, and then find the proportion of `1`

s. (`1`

means there was only one unique value: all “H” or all “T”)

```
count1 <- apply(s1, 1, function(x)length(unique(x)))
mean(count1 == 1)
```

`## [1] 0.25`

We can do the same with rolling 3 dice. Give `expand.grid`

3 dice and have it generate all possible results. This will generate 216 possibilities, so there’s no eyeballing this for an answer. As before, we’ll apply a function to each row and determine if there are any duplicates. The `anyDuplicated`

function returns the location of the first duplicate within a vector. If there are no duplicates, the result is a `0`

, which means we just need to find the proportion of `0`

s.

```
s2 <- expand.grid(1:6, 1:6, 1:6)
count2 <- apply(s2, 1, anyDuplicated)
mean(count2 == 0)
```

`## [1] 0.5555556`

An easier way to do that is to use permutation calculations. There are \(6^3 = 216\) possible results from rolling 3 dice. Furthermore there are \(6 \cdot 5 \cdot 4 = 120\) possible non-matching results. There are 6 possibilities for the first die, only 5 for the second, and 4 for the third. We can then divide to find the probability: \(120/216 = 0.55\).

Clearly we’d rather roll the dice (assuming fair coins and fair dice).

We can also answer the question using the binomial probability distribution. The binomial distribution is appropriate when you have:

- two outcomes (Heads vs Tales, no match vs One or more match)
- independent events
- same probability for each event

The `dbinom`

function calculates probabilities of binomial outcomes. Below we use it to calculate the probability of 0 heads (3 tails) and 3 heads, and sum the total. The `x`

argument is the sum of “successes”, for example 0 heads (or tails, whatever you call a “success”). The `size`

argument is the number of trials, or coins in this case. The `prob`

argument is the probability of success on each trial, or for each coin in this case.

```
dbinom(x = 0, size = 3, prob = 0.5) +
dbinom(x = 3, size = 3, prob = 0.5)
```

`## [1] 0.25`

We can also frame the dice rolling as a binomial outcome, but it’s a little trickier. Think of rolling the dice one at a time:

- It doesn’t matter what we roll the first time. We don’t care if it’s a 1 or 6 or whatever. We’re certain to get something, so the probability is 1.
- The second role is a “success” if it
*does not*match the first role. That’s one dice roll (`size = 1`

) where success (`x = 1`

) happens with probability of 5/6. - The third and final roll is a “success” if it doesn’t match either of the first two rolls. That’s one dice roll (
`size = 1`

) where success (`x = 1`

) happens with probability of 4/6.

Notice we don’t add these probabilities but rather *multiply* them.

```
1 * dbinom(x = 1, size = 1, prob = 5/6) *
dbinom(x = 1, size = 1, prob = 4/6)
```

`## [1] 0.5555556`

We added the coin flipping probabilities because they each represented mutually exclusive events. They cannot both occur. There is one way to get 0 successes and one way to get 3 successes. Together they sum to the probability of all matching (ie, all heads or all tales).

We multiplied the dice rolling probabilities because they each represented events that could occur at the same time, and they were conditional if we considered each die in turn. This requires the multiplication rule of probability.

]]>*p<0.1; p<0.05;* p<0.01

What should be displayed is this:

*p<0.1; **p<0.05; ***p<0.01

This is the legend for understanding the “stars” in the regression table. What’s happening is the asterisks are being treated as markdown code, which is adding italics and bolding to the note instead of simply showing the asterisks verbatim. (Recall that in Markdown surrounding text with one asterisk adds italics and surrounding text with two asterisks adds bolding.)

To fix this we can add the following two arguments to the stargazer function:

`notes.append = FALSE`

`notes = c("<sup>⋆</sup>p<0.1; <sup>⋆⋆</sup>p<0.05; <sup>⋆⋆⋆</sup>p<0.01")`

`⋆`

is an HTML entity for a star character. `notes.append = FALSE`

says to NOT append the note but rather replace the existing note. The `notes`

argument specifies the note we want to add using the `⋆`

HTML entity.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

```
n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)
```

The data contain a numeric response, `y`

, that is a function of `age`

and `sex`

. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to \(\beta_0\) through \(\beta_3\) in the following model:

\[y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex\]

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

```
mod <- lm(y ~ age * sex, dat)
summary(mod)
```

## ## Call: ## lm(formula = y ~ age * sex, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.8986 -5.8552 -0.2503 6.0507 30.6188 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.27268 1.93776 0.141 0.888 ## age 0.79781 0.04316 18.484 <2e-16 *** ## sexm 2.07143 2.84931 0.727 0.468 ## age:sexm -0.72702 0.06462 -11.251 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.661 on 396 degrees of freedom ## Multiple R-squared: 0.7874, Adjusted R-squared: 0.7858 ## F-statistic: 489 on 3 and 396 DF, p-value: < 2.2e-16

While the coefficient estimates for age and the age \(\times\) sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for `age:sexm`

. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

```
library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
geom_point() +
geom_smooth(method="lm")
```

Visually it appears the effect of age is *not* independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for `age:sexm`

evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

- Testing if the
`Intercept`

is 0 is testing whether`y`

is 0 for females at age 0. - Testing if
`age`

is 0 is testing whether`age`

is associated with`y`

for males. - Testing if
`sexm`

is 0 is testing whether`sex`

is associated with`y`

for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

- Is
`age`

associated with`y`

? - Is
`sex`

associated with`y`

? - Are either
`age`

or`sex`

associated with`y`

?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if `age`

is associated with `y`

, we need to test if both the `age`

and `age:sexm`

coefficents are equal to 0. The `car`

package by John Fox provides a nice function for this purpose called `linearHypothesis`

. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

```
library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))
```

## Linear hypothesis test ## ## Hypothesis: ## age = 0 ## age:sexm = 0 ## ## Model 1: restricted model ## Model 2: y ~ age * sex ## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 398 55494 ## 2 396 29704 2 25790 171.91 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with `y`

.

Likewise, to test if `sex`

is associated with `y`

, we need to test if both the `sex`

and `age:sexm`

coefficents are equal to 0.

```
linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))
```

## Linear hypothesis test ## ## Hypothesis: ## sexm = 0 ## age:sexm = 0 ## ## Model 1: restricted model ## Model 2: y ~ age * sex ## ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 398 119354 ## 2 396 29704 2 89651 597.6 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected this test confirms that `sex`

is associated with `y`

, just as we specified when we simulated the data.

Now that we have established that `age`

is associated with `y`

, and that the association differs for each `sex`

, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

```
round(coef(mod),3)
```

## (Intercept) age sexm age:sexm ## 0.273 0.798 2.071 -0.727

That corresponds to the following model:

\[y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex\]

When `sex`

is female, the fitted model is

\[y = 0.273 + 0.799 age \]

This says the slope of the `age`

is about 0.8 when `sex`

is female.

When `sex`

is male, the fitted model is

\[y = (0.273 + 2.071) + (0.797 – 0.727) age \]

\[y = 2.344 + 0.07 age \]

This says the slope of the `age`

is about 0.07 when `sex`

is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the `age`

coefficient. It shows about 0.04. Adding and subtracting 2 \(\times\) 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the `confint`

function:

```
confint(mod, parm = "age")
```

## 2.5 % 97.5 % ## age 0.7129564 0.8826672

The standard error of the age slope estimate for males takes a little more work. Another `car`

function useful for this is the `deltaMethod`

function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the *delta method*. Here’s one way to do it for our model

```
deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3))
```

## Estimate SE 2.5 % 97.5 % ## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426

The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the `margins`

package. We use the `margins`

function with our model object and specify that we want to estimate the marginal effect of `age`

at each level of `sex`

. (“marginal effect of `age`

” is another way of saying the effect of age at each level of `sex`

)

```
library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))
```

## Average marginal effects at specified values

## lm(formula = y ~ age * sex, data = dat)

## at(sex) age ## f 0.79781 ## m 0.07079

This does the formula work we did above. It plugs in `sex`

and returns the estmimated slope coefficient for `age`

. If we wrap the call in `summary`

we get the standard errors and confidence intervals.

```
summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))
```

## factor sex AME SE z p lower upper ## age 1.0000 0.7978 0.0432 18.4841 0.0000 0.7132 0.8824 ## age 2.0000 0.0708 0.0481 1.4722 0.1410 -0.0235 0.1650]]>