Regression Adjustments

Study: Meadowfoam

Meadowfoam is a flowering California native plant.

Can we effect the number of flowers produced by plants by providing extra light treatment, which varies in its intensity as well as it’s timing?

A model for a categorical and continuous effect (ANCOVA):

\[ y_{ij} = \mu + \alpha_i + \beta_1 x_{ij} + \epsilon_{ij} \]

Each group has a linear model with the same effect on \(x\) and different effect on the categorical effect (two intercepts).

\[\begin{align} y_{i1} &= \mu + \alpha_1 + \beta_1 x_{i1} + \epsilon_{i1} = \mu_1 + \beta_1 x_{i1} + \epsilon_{i1} \\ y_{i2} &= \mu + \alpha_2 + \beta_1 x_{i2} + \epsilon_{i2} = \mu_2 + \beta_1 x_{i2} + \epsilon_{i2} \end{align}\]

Regression Adjustments

Meadowfoam

meadowfoam

# A tibble: 24 × 3
   flowers time  intensity
     <dbl> <chr>     <dbl>
 1    62.3 late        150
 2    77.4 late        150
 3    55.3 late        300
 4    54.2 late        300
 5    49.6 late        450
 6    61.9 late        450
 7    39.4 late        600
 8    45.7 late        600
 9    31.3 late        750
10    44.9 late        750
# ℹ 14 more rows

Let’s say that only time was randomly assigned and intensity was another variable that was recorded that we know is a good predictor of flowers.

What we’ve been doing

Since time is the only experimental factor, we would fit this model:

options(contrasts=c("contr.sum","contr.poly"))
m1 <- lm(flowers ~ time, data = meadowfoam)
summary(m1)$coef

             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 56.137500   2.556552 21.958286 1.874990e-16
time1        6.079166   2.556552  2.377877 2.652621e-02

Adjust for intensity

m_full <- lm(flowers ~ intensity + time, data = meadowfoam)
m_no_time <- lm(flowers ~ intensity, data = meadowfoam)
summary(m_full)$coef

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 77.38500017 2.99815637 25.810862 2.166375e-17
intensity   -0.04047143 0.00513237 -7.885525 1.036787e-07
time1        6.07916641 1.31477854  4.623719 1.463777e-04

summary(m_no_time)$coef

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) 77.38500017 4.161186173 18.596861 6.059011e-15
intensity   -0.04047143 0.007123293 -5.681562 1.029503e-05

anova(m_no_time, m_full)

Analysis of Variance Table

Model 1: flowers ~ intensity
Model 2: flowers ~ intensity + time
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     22 1758.19                                  
2     21  871.24  1    886.95 21.379 0.0001464 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Cautions

When fitting regression models to experimental data:

Non-experimental variables have no causal interpretation.
Regression adjustments generally should not be used in very small datasets (\(N < 20\)) because df are better spent estimating causal effects.

Collinearity

Simulated data with two explanatory variables.

# A tibble: 100 × 3
        y     x1     x2
    <dbl>  <dbl>  <dbl>
 1  6.69  -0.986 -0.532
 2 -0.837  0.372 -0.431
 3  1.79  -0.232 -0.586
 4  3.95  -1.22   1.10 
 5  0.662  0.266 -0.653
 6  1.43  -0.602  0.382
 7  1.77   0.280 -1.98 
 8 -7.97   1.87  -0.536
 9  1.11   0.441 -0.363
10  5.24  -0.942 -0.367
# ℹ 90 more rows

pairs(weak_collinearity)

cor(weak_collinearity)

            y         x1         x2
y   1.0000000 -0.9075633 -0.4952396
x1 -0.9075633  1.0000000  0.1272063
x2 -0.4952396  0.1272063  1.0000000

Collinearity: describes the presence of linear relationships between the explanatory variables.

Results from Linear Models

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]

\[ Var(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \]

\[ Var(\hat{\beta}_j) = \frac{\sigma^2}{(n-1) Var(X_j)} \cdot \frac{1}{1 - R^2_j} \]

where \(R^2_j\) is the \(R^2\) from regressing \(X_j\) on other \(X\). The second term is called the Variance Inflation Factor (VIF).

Weak Collinearity

m1 <- lm(y ~ x1 + x2, data = weak_collinearity)
summary(m1)$coef

              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -0.1166074 0.09916054  -1.175945 2.424949e-01
x1          -4.9560576 0.10187288 -48.649429 6.050966e-70
x2          -2.0795123 0.09505426 -21.877109 2.747300e-39

This linear model corresponds to a plane in 3D.

Strong Collinearity

# A tibble: 100 × 3
        y     x1     x2
    <dbl>  <dbl>  <dbl>
 1   5.62 -0.950 -0.792
 2  -1.55  0.149  0.349
 3 -13.4   1.67   1.62 
 4 -10.4   1.66   1.54 
 5  -4.01  0.667  0.743
 6   7.63 -0.994 -0.983
 7  -3.27  0.341  0.232
 8   6.35 -0.905 -0.720
 9 -14.2   1.87   1.82 
10   5.55 -1.08  -1.09 
# ℹ 90 more rows

Strong Collinearity

pairs(strong_collinearity)

Strong collinearity

cor(strong_collinearity)

            y         x1         x2
y   1.0000000 -0.9885235 -0.9844989
x1 -0.9885235  1.0000000  0.9908118
x2 -0.9844989  0.9908118  1.0000000

High collinearity leads to high VIF.

Designed Experiment

# A tibble: 100 × 3
        y    x1    x2
    <dbl> <dbl> <dbl>
 1  19.4     -3    -3
 2  11.2     -1    -3
 3   5.33     0    -3
 4   2.22     1    -3
 5 -10.6      3    -3
 6  18.3     -3    -1
 7   7.20    -1    -1
 8   3.13     0    -1
 9  -4.06     1    -1
10 -13.2      3    -1
# ℹ 90 more rows

Designed Experiment

pairs(designed_exp)

Design Experiment

cor(designed_exp)

            y        x1         x2
y   1.0000000 -0.921607 -0.3752132
x1 -0.9216070  1.000000  0.0000000
x2 -0.3752132  0.000000  1.0000000

No collinearity leads to . . .