We are using some data stored in the R package SPH.140.615.
library(SPH.140.615)
Imagine we have three different treatments, and we are interested whether treatment has an effect on a response. In this example, the response also depends on age.
head(dat.anc1)
## rsp trt age
## 1 101.747 A 36
## 2 96.697 A 48
## 3 96.557 A 48
## 4 98.311 A 42
## 5 104.293 A 30
## 6 98.865 A 43
summary(dat.anc1)
## rsp trt age
## Min. : 95.90 A:12 Min. :27.00
## 1st Qu.: 98.37 B:12 1st Qu.:37.00
## Median :100.42 C:12 Median :42.00
## Mean :100.56 Mean :40.53
## 3rd Qu.:102.53 3rd Qu.:44.25
## Max. :107.69 Max. :49.00
Plot the response across the treatment levels.
par(las=1)
boxplot(split(dat.anc1$rsp,dat.anc1$trt), col=c("red","blue","green3"), ylab="response")
The average response in treatment C seems to be a bit higher than in the two other groups. Let’s carry out an ANOVA.
aov.fit <- aov(rsp ~ trt, data=dat.anc1)
summary(aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 2 44.01 22.007 2.958 0.0658 .
## Residuals 33 245.54 7.441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The differences in means are not significant. However, age is strongly related to the response, but not to treatment.
par(las=1)
plot(dat.anc1$age, dat.anc1$rsp, pch=21, bg=c("red","blue","green3")[dat.anc1$trt], xlab="age", ylab="response")
boxplot(split(dat.anc1$age,dat.anc1$trt), col=c("red","blue","green3"), ylab="age")
Let’s look at the residuals after regressing out age.
dat.anc1$res <- residuals(lm(rsp ~ age, data=dat.anc1))
par(las=1)
boxplot(split(dat.anc1$res,dat.anc1$trt), col=c("red","blue","green3"), ylab="residuals")
The differences in means are now much more pronounced. Let’s carry out an ANOVA on the residuals.
aov.fit <- aov(res ~ trt, data=dat.anc1)
summary(aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 2 11.12 5.558 5.631 0.00787 **
## Residuals 33 32.58 0.987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The differences in means are very significant. Let’s analyze the data in one wash-up using an Analysis of Covariance (ANCOVA). We fit and compare the linear models with and without the treatment.
ancova.fit <- lm(rsp ~ trt + age, data=dat.anc1)
summary(ancova.fit)
##
## Call:
## lm(formula = rsp ~ trt + age, data = dat.anc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7766 -0.6818 0.1645 0.4679 2.1091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.49624 1.41098 84.690 < 2e-16 ***
## trtB 0.67656 0.40923 1.653 0.10806
## trtC 1.42126 0.41768 3.403 0.00181 **
## age -0.48461 0.03321 -14.591 1.08e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.001 on 32 degrees of freedom
## Multiple R-squared: 0.8892, Adjusted R-squared: 0.8788
## F-statistic: 85.6 on 3 and 32 DF, p-value: 2.255e-15
ancova.fit.red <- lm(rsp ~ age, data=dat.anc1)
summary(ancova.fit.red)
##
## Call:
## lm(formula = rsp ~ age, data = dat.anc1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6097 -0.7160 -0.2374 0.8048 2.8654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.14059 1.50020 80.75 < 2e-16 ***
## age -0.50793 0.03672 -13.83 1.6e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.134 on 34 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8447
## F-statistic: 191.3 on 1 and 34 DF, p-value: 1.6e-15
anova(ancova.fit.red, ancova.fit)
## Analysis of Variance Table
##
## Model 1: rsp ~ age
## Model 2: rsp ~ trt + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 34 43.693
## 2 32 32.082 2 11.611 5.7907 0.007138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary: age explained a lot of variability in the response, so by including age in the linear model the residual variability was much smaller and we had more power to detect the treatment effect.
Pro tip: for a post-hoc comparison of the treatment levels, you can use the estimated marginal means (EMMs) and take multiple comparisons into account, for example using Tukey’s method for all pairwise comparisons.
suppressMessages(library(emmeans))
emm <- emmeans(ancova.fit,"trt")
emm
## trt emmean SE df lower.CL upper.CL
## A 99.9 0.291 32 99.3 100
## B 100.5 0.289 32 99.9 101
## C 101.3 0.293 32 100.7 102
##
## Confidence level used: 0.95
pairs(emm)
## contrast estimate SE df t.ratio p.value
## A - B -0.677 0.409 32 -1.653 0.2387
## A - C -1.421 0.418 32 -3.403 0.0050
## B - C -0.745 0.414 32 -1.798 0.1863
##
## P value adjustment: tukey method for comparing a family of 3 estimates
Same context as above.
head(dat.anc2)
## rsp trt age
## 1 105.138 A 30
## 2 104.881 A 30
## 3 107.198 A 26
## 4 100.931 A 36
## 5 102.197 A 34
## 6 104.886 A 28
summary(dat.anc2)
## rsp trt age
## Min. : 87.91 A:12 Min. :26.00
## 1st Qu.: 96.31 B:12 1st Qu.:31.75
## Median :100.86 C:12 Median :37.00
## Mean : 99.64 Mean :40.42
## 3rd Qu.:103.76 3rd Qu.:50.00
## Max. :108.08 Max. :62.00
Plot the response across the treatment levels.
par(las=1)
boxplot(split(dat.anc2$rsp,dat.anc2$trt), col=c("red","blue","green3"), ylab="response")
Looks like the response is vastly different across treatment groups. Let’s carry out an ANOVA.
aov.fit <- aov(rsp ~ trt, data=dat.anc2)
summary(aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 2 701.1 350.6 39.97 1.53e-09 ***
## Residuals 33 289.4 8.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The differences in means are highly significant. However, age is again strongly related to the response, and in this example, age is also related to treatment.
par(las=1)
plot(dat.anc2$age, dat.anc2$rsp, pch=21 ,bg=c("red","blue","green3")[dat.anc2$trt], xlab="age" ,ylab="response")
boxplot(split(dat.anc2$age,dat.anc2$trt), col=c("red","blue","green3"), ylab="age")
Let’s look at the residuals after regressing out age.
dat.anc2$res <- residuals(lm(rsp ~ age, data=dat.anc2))
boxplot(split(dat.anc2$res,dat.anc2$trt), col=c("red","blue","green3"), ylab="residuals")
The means in the treatment groups now look about the same. Let’s carry out an ANOVA on the residuals.
aov.fit <- aov(res ~ trt, data=dat.anc2)
summary(aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 2 0.97 0.4858 0.475 0.626
## Residuals 33 33.78 1.0238
The differences in means are not significant. Let’s analyze the data in one wash-up using an ANCOVA.
ancova.fit <- lm(rsp ~ trt + age, data=dat.anc2)
summary(ancova.fit)
##
## Call:
## lm(formula = rsp ~ trt + age, data = dat.anc2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3195 -0.8468 -0.1598 0.7004 2.5235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.25394 1.04030 115.596 <2e-16 ***
## trtB -0.38867 0.49645 -0.783 0.439
## trtC -0.09653 0.80367 -0.120 0.905
## age -0.50598 0.03251 -15.562 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.027 on 32 degrees of freedom
## Multiple R-squared: 0.9659, Adjusted R-squared: 0.9627
## F-statistic: 302.1 on 3 and 32 DF, p-value: < 2.2e-16
ancova.fit.red <- lm(rsp ~ age, data=dat.anc2)
summary(ancova.fit.red)
##
## Call:
## lm(formula = rsp ~ age, data = dat.anc2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1797 -0.7911 -0.1908 0.7044 2.6066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.1548 0.6917 173.72 <2e-16 ***
## age -0.5075 0.0166 -30.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 34 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.9639
## F-statistic: 935 on 1 and 34 DF, p-value: < 2.2e-16
anova(ancova.fit.red, ancova.fit)
## Analysis of Variance Table
##
## Model 1: rsp ~ age
## Model 2: rsp ~ trt + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 34 34.757
## 2 32 33.783 2 0.97404 0.4613 0.6346
Summary: age explained a lot of variability in the response, but also differed across treatment groups. When we include age in the linear model, treatment is not significant. The differences in the mean responses across the treatment groups we observed initially was solely due to age being associated with both treatment and response (e.g., age is a confounder).
In linear models over-fitting does not cause bias in your predictor of interest (the standard error can be slightly inflated though), but under-fitting can cause bias. So in general, it is better to over-fit than to under-fit, especially when you have a lot of data.
The data.
flies <- data.frame(rsp = c( 9.6, 9.3, 9.3,
10.6, 9.1, 9.2,
9.8, 9.3, 9.5,
10.7, 9.1,10.0,
11.1,11.1,10.4,
10.9,11.8,10.8,
12.8,10.6,10.7),
strain = factor(rep(c("OL","BELL","bwb"),7)),
density = rep(c(60,80,160,320,640,1280,2560),rep(3,7)))
Plot the data.
par(las=1)
plot(rsp~density, data=flies, type="n", xlab="density", ylab="development days", xaxt="n", log="x")
lines(rsp~density, data=subset(flies,strain=="BELL"), type="b", col="blue")
lines(rsp~density, data=subset(flies,strain=="bwb"), type="b",col="red")
lines(rsp~density, data=subset(flies,strain=="OL"), type="b",col="green3")
axis(1,c(60,80,160,320,640,1280,2560))
legend("topleft", c("OL","bwb","BELL"), lwd=2, col=c("green3","red","blue"), bty="n")
Fit the linear model with ‘strain’ as a factor and log2 ‘density’ as a numeric predictor and plot the least squares regression lines.
ancova.fit <- lm(rsp ~ strain + log2(density), data=flies)
summary(ancova.fit)
##
## Call:
## lm(formula = rsp ~ strain + log2(density), data = flies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90985 -0.31482 -0.04773 0.24417 1.00009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.72243 0.56995 11.795 1.31e-09 ***
## strainbwb -0.05714 0.29139 -0.196 0.8469
## strainOL 0.74286 0.29139 2.549 0.0207 *
## log2(density) 0.39503 0.06322 6.248 8.83e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5451 on 17 degrees of freedom
## Multiple R-squared: 0.7402, Adjusted R-squared: 0.6943
## F-statistic: 16.14 on 3 and 17 DF, p-value: 3.184e-05
params <- summary(ancova.fit)$coef[,1]
par(mfrow=c(2,2), las=1, mar=c(5,4,1,2))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
abline(params[1],2^params[4], lwd=2, col="blue")
points(rsp~density, data=subset(flies,strain=="BELL"), col="blue")
axis(1,c(60,80,160,320,640,1280,2560))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
abline(params[1]+params[2],2^params[4], lwd=2, col="red")
points(rsp~density, data=subset(flies,strain=="bwb"), col="red")
axis(1,c(60,80,160,320,640,1280,2560))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
abline(params[1]+params[3],2^params[4], lwd=2, col="green3")
points(rsp~density, data=subset(flies,strain=="OL"), col="green3")
axis(1,c(60,80,160,320,640,1280,2560))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
points(rsp~density, data=subset(flies,strain=="BELL"), col="blue")
points(rsp~density, data=subset(flies,strain=="bwb"), col="red")
points(rsp~density, data=subset(flies,strain=="OL"), col="green3")
abline(params[1],2^params[4], lwd=2, col="blue")
abline(params[1]+params[2],2^params[4], lwd=2, col="red")
abline(params[1]+params[3],2^params[4], lwd=2, col="green3")
axis(1,c(60,80,160,320,640,1280,2560))
legend("topleft", c("OL","bwb","BELL"), lwd=2, col=c("green3","red","blue"), bty="n")