Statistics for Laboratory Scientists ( 140.615 )

Multiple Linear Regression - Advanced

We are using some data stored in the R package SPH.140.615.

library(SPH.140.615)


Example: detecting a treatment effect

Imagine we have three different treatments, and we are interested whether treatment has an effect on a response. In this example, the response also depends on age.

head(dat.anc1)
##       rsp trt age
## 1 101.747   A  36
## 2  96.697   A  48
## 3  96.557   A  48
## 4  98.311   A  42
## 5 104.293   A  30
## 6  98.865   A  43
summary(dat.anc1)
##       rsp         trt         age       
##  Min.   : 95.90   A:12   Min.   :27.00  
##  1st Qu.: 98.37   B:12   1st Qu.:37.00  
##  Median :100.42   C:12   Median :42.00  
##  Mean   :100.56          Mean   :40.53  
##  3rd Qu.:102.53          3rd Qu.:44.25  
##  Max.   :107.69          Max.   :49.00

Plot the response across the treatment levels.

par(las=1)
boxplot(split(dat.anc1$rsp,dat.anc1$trt), col=c("red","blue","green3"), ylab="response")

The average response in treatment C seems to be a bit higher than in the two other groups. Let’s carry out an ANOVA.

aov.fit <- aov(rsp ~ trt, data=dat.anc1)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## trt          2  44.01  22.007   2.958 0.0658 .
## Residuals   33 245.54   7.441                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The differences in means are not significant. However, age is strongly related to the response, but not to treatment.

par(las=1)
plot(dat.anc1$age, dat.anc1$rsp, pch=21, bg=c("red","blue","green3")[dat.anc1$trt], xlab="age", ylab="response")

boxplot(split(dat.anc1$age,dat.anc1$trt), col=c("red","blue","green3"), ylab="age")

Let’s look at the residuals after regressing out age.

dat.anc1$res <- residuals(lm(rsp ~ age, data=dat.anc1))
par(las=1)
boxplot(split(dat.anc1$res,dat.anc1$trt), col=c("red","blue","green3"), ylab="residuals")

The differences in means are now much more pronounced. Let’s carry out an ANOVA on the residuals.

aov.fit <- aov(res ~ trt, data=dat.anc1)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## trt          2  11.12   5.558   5.631 0.00787 **
## Residuals   33  32.58   0.987                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The differences in means are very significant. Let’s analyze the data in one wash-up using an Analysis of Covariance (ANCOVA). We fit and compare the linear models with and without the treatment.

ancova.fit <- lm(rsp ~ trt + age, data=dat.anc1)
summary(ancova.fit)
## 
## Call:
## lm(formula = rsp ~ trt + age, data = dat.anc1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7766 -0.6818  0.1645  0.4679  2.1091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.49624    1.41098  84.690  < 2e-16 ***
## trtB          0.67656    0.40923   1.653  0.10806    
## trtC          1.42126    0.41768   3.403  0.00181 ** 
## age          -0.48461    0.03321 -14.591 1.08e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 32 degrees of freedom
## Multiple R-squared:  0.8892, Adjusted R-squared:  0.8788 
## F-statistic:  85.6 on 3 and 32 DF,  p-value: 2.255e-15
ancova.fit.red <- lm(rsp ~ age, data=dat.anc1)
summary(ancova.fit.red)
## 
## Call:
## lm(formula = rsp ~ age, data = dat.anc1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6097 -0.7160 -0.2374  0.8048  2.8654 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 121.14059    1.50020   80.75  < 2e-16 ***
## age          -0.50793    0.03672  -13.83  1.6e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.134 on 34 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8447 
## F-statistic: 191.3 on 1 and 34 DF,  p-value: 1.6e-15
anova(ancova.fit.red, ancova.fit)
## Analysis of Variance Table
## 
## Model 1: rsp ~ age
## Model 2: rsp ~ trt + age
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     34 43.693                                
## 2     32 32.082  2    11.611 5.7907 0.007138 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary: age explained a lot of variability in the response, so by including age in the linear model the residual variability was much smaller and we had more power to detect the treatment effect.

Pro tip: for a post-hoc comparison of the treatment levels, you can use the estimated marginal means (EMMs) and take multiple comparisons into account, for example using Tukey’s method for all pairwise comparisons.

suppressMessages(library(emmeans))
emm <- emmeans(ancova.fit,"trt")
emm
##  trt emmean    SE df lower.CL upper.CL
##  A     99.9 0.291 32     99.3      100
##  B    100.5 0.289 32     99.9      101
##  C    101.3 0.293 32    100.7      102
## 
## Confidence level used: 0.95
pairs(emm)
##  contrast estimate    SE df t.ratio p.value
##  A - B      -0.677 0.409 32  -1.653  0.2387
##  A - C      -1.421 0.418 32  -3.403  0.0050
##  B - C      -0.745 0.414 32  -1.798  0.1863
## 
## P value adjustment: tukey method for comparing a family of 3 estimates


Example: confounding

Same context as above.

head(dat.anc2)
##       rsp trt age
## 1 105.138   A  30
## 2 104.881   A  30
## 3 107.198   A  26
## 4 100.931   A  36
## 5 102.197   A  34
## 6 104.886   A  28
summary(dat.anc2)
##       rsp         trt         age       
##  Min.   : 87.91   A:12   Min.   :26.00  
##  1st Qu.: 96.31   B:12   1st Qu.:31.75  
##  Median :100.86   C:12   Median :37.00  
##  Mean   : 99.64          Mean   :40.42  
##  3rd Qu.:103.76          3rd Qu.:50.00  
##  Max.   :108.08          Max.   :62.00

Plot the response across the treatment levels.

par(las=1)
boxplot(split(dat.anc2$rsp,dat.anc2$trt), col=c("red","blue","green3"), ylab="response")

Looks like the response is vastly different across treatment groups. Let’s carry out an ANOVA.

aov.fit <- aov(rsp ~ trt, data=dat.anc2)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## trt          2  701.1   350.6   39.97 1.53e-09 ***
## Residuals   33  289.4     8.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The differences in means are highly significant. However, age is again strongly related to the response, and in this example, age is also related to treatment.

par(las=1)
plot(dat.anc2$age, dat.anc2$rsp, pch=21 ,bg=c("red","blue","green3")[dat.anc2$trt], xlab="age" ,ylab="response")

boxplot(split(dat.anc2$age,dat.anc2$trt), col=c("red","blue","green3"), ylab="age")

Let’s look at the residuals after regressing out age.

dat.anc2$res <- residuals(lm(rsp ~ age, data=dat.anc2))
boxplot(split(dat.anc2$res,dat.anc2$trt), col=c("red","blue","green3"), ylab="residuals")

The means in the treatment groups now look about the same. Let’s carry out an ANOVA on the residuals.

aov.fit <- aov(res ~ trt, data=dat.anc2)
summary(aov.fit)
##             Df Sum Sq Mean Sq F value Pr(>F)
## trt          2   0.97  0.4858   0.475  0.626
## Residuals   33  33.78  1.0238

The differences in means are not significant. Let’s analyze the data in one wash-up using an ANCOVA.

ancova.fit <- lm(rsp ~ trt + age, data=dat.anc2)
summary(ancova.fit)
## 
## Call:
## lm(formula = rsp ~ trt + age, data = dat.anc2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3195 -0.8468 -0.1598  0.7004  2.5235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120.25394    1.04030 115.596   <2e-16 ***
## trtB         -0.38867    0.49645  -0.783    0.439    
## trtC         -0.09653    0.80367  -0.120    0.905    
## age          -0.50598    0.03251 -15.562   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.027 on 32 degrees of freedom
## Multiple R-squared:  0.9659, Adjusted R-squared:  0.9627 
## F-statistic: 302.1 on 3 and 32 DF,  p-value: < 2.2e-16
ancova.fit.red <- lm(rsp ~ age, data=dat.anc2)
summary(ancova.fit.red)
## 
## Call:
## lm(formula = rsp ~ age, data = dat.anc2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1797 -0.7911 -0.1908  0.7044  2.6066 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120.1548     0.6917  173.72   <2e-16 ***
## age          -0.5075     0.0166  -30.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 34 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.9639 
## F-statistic:   935 on 1 and 34 DF,  p-value: < 2.2e-16
anova(ancova.fit.red, ancova.fit)
## Analysis of Variance Table
## 
## Model 1: rsp ~ age
## Model 2: rsp ~ trt + age
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     34 34.757                           
## 2     32 33.783  2   0.97404 0.4613 0.6346

Summary: age explained a lot of variability in the response, but also differed across treatment groups. When we include age in the linear model, treatment is not significant. The differences in the mean responses across the treatment groups we observed initially was solely due to age being associated with both treatment and response (e.g., age is a confounder).

In linear models over-fitting does not cause bias in your predictor of interest (the standard error can be slightly inflated though), but under-fitting can cause bias. So in general, it is better to over-fit than to under-fit, especially when you have a lot of data.


Example from class: flies, density, strain, no replicates

The data.

flies <- data.frame(rsp = c( 9.6, 9.3, 9.3,
                            10.6, 9.1, 9.2,
                             9.8, 9.3, 9.5,
                            10.7, 9.1,10.0,
                            11.1,11.1,10.4,
                            10.9,11.8,10.8,
                            12.8,10.6,10.7),
                   strain = factor(rep(c("OL","BELL","bwb"),7)),
                   density = rep(c(60,80,160,320,640,1280,2560),rep(3,7)))

Plot the data.

par(las=1)
plot(rsp~density, data=flies, type="n", xlab="density", ylab="development days", xaxt="n", log="x")
lines(rsp~density, data=subset(flies,strain=="BELL"), type="b", col="blue")
lines(rsp~density, data=subset(flies,strain=="bwb"), type="b",col="red")
lines(rsp~density, data=subset(flies,strain=="OL"), type="b",col="green3")
axis(1,c(60,80,160,320,640,1280,2560))
legend("topleft", c("OL","bwb","BELL"), lwd=2, col=c("green3","red","blue"), bty="n")

Fit the linear model with ‘strain’ as a factor and log2 ‘density’ as a numeric predictor and plot the least squares regression lines.

ancova.fit <- lm(rsp ~ strain + log2(density), data=flies) 
summary(ancova.fit)
## 
## Call:
## lm(formula = rsp ~ strain + log2(density), data = flies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90985 -0.31482 -0.04773  0.24417  1.00009 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.72243    0.56995  11.795 1.31e-09 ***
## strainbwb     -0.05714    0.29139  -0.196   0.8469    
## strainOL       0.74286    0.29139   2.549   0.0207 *  
## log2(density)  0.39503    0.06322   6.248 8.83e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5451 on 17 degrees of freedom
## Multiple R-squared:  0.7402, Adjusted R-squared:  0.6943 
## F-statistic: 16.14 on 3 and 17 DF,  p-value: 3.184e-05
params <- summary(ancova.fit)$coef[,1]
par(mfrow=c(2,2), las=1, mar=c(5,4,1,2))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
abline(params[1],2^params[4], lwd=2, col="blue")
points(rsp~density, data=subset(flies,strain=="BELL"), col="blue")
axis(1,c(60,80,160,320,640,1280,2560))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
abline(params[1]+params[2],2^params[4], lwd=2, col="red")
points(rsp~density, data=subset(flies,strain=="bwb"), col="red")
axis(1,c(60,80,160,320,640,1280,2560))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
abline(params[1]+params[3],2^params[4], lwd=2, col="green3")
points(rsp~density, data=subset(flies,strain=="OL"), col="green3")
axis(1,c(60,80,160,320,640,1280,2560))
plot(flies$density, flies$rsp, type="n", log="x", xlab="density", ylab="development days", xaxt="n")
points(rsp~density, data=subset(flies,strain=="BELL"), col="blue")
points(rsp~density, data=subset(flies,strain=="bwb"), col="red")
points(rsp~density, data=subset(flies,strain=="OL"), col="green3")
abline(params[1],2^params[4], lwd=2, col="blue")
abline(params[1]+params[2],2^params[4], lwd=2, col="red")
abline(params[1]+params[3],2^params[4], lwd=2, col="green3")
axis(1,c(60,80,160,320,640,1280,2560))
legend("topleft", c("OL","bwb","BELL"), lwd=2, col=c("green3","red","blue"), bty="n")