Meadowfoam is a flowering California native plant.
Can we effect the number of flowers produced by plants by providing extra light treatment, which varies in its intensity as well as it’s timing?

A model for a categorical and continuous effect (ANCOVA):
\[ y_{ij} = \mu + \alpha_i + \beta_1 x_{ij} + \epsilon_{ij} \]
Each group has a linear model with the same effect on \(x\) and different effect on the categorical effect (two intercepts).
\[\begin{align} y_{i1} &= \mu + \alpha_1 + \beta_1 x_{i1} + \epsilon_{i1} = \mu_1 + \beta_1 x_{i1} + \epsilon_{i1} \\ y_{i2} &= \mu + \alpha_2 + \beta_1 x_{i2} + \epsilon_{i2} = \mu_2 + \beta_1 x_{i2} + \epsilon_{i2} \end{align}\]
# A tibble: 24 × 3
flowers time intensity
<dbl> <chr> <dbl>
1 62.3 late 150
2 77.4 late 150
3 55.3 late 300
4 54.2 late 300
5 49.6 late 450
6 61.9 late 450
7 39.4 late 600
8 45.7 late 600
9 31.3 late 750
10 44.9 late 750
# ℹ 14 more rows
Let’s say that only time was randomly assigned and intensity was another variable that was recorded that we know is a good predictor of flowers.
Since time is the only experimental factor, we would fit this model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.38500017 2.99815637 25.810862 2.166375e-17
intensity -0.04047143 0.00513237 -7.885525 1.036787e-07
time1 6.07916641 1.31477854 4.623719 1.463777e-04
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.38500017 4.161186173 18.596861 6.059011e-15
intensity -0.04047143 0.007123293 -5.681562 1.029503e-05
Analysis of Variance Table
Model 1: flowers ~ intensity
Model 2: flowers ~ intensity + time
Res.Df RSS Df Sum of Sq F Pr(>F)
1 22 1758.19
2 21 871.24 1 886.95 21.379 0.0001464 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When fitting regression models to experimental data:
Non-experimental variables have no causal interpretation.
Regression adjustments generally should not be used in very small datasets (\(N < 20\)) because df are better spent estimating causal effects.
Simulated data with two explanatory variables.
# A tibble: 100 × 3
y x1 x2
<dbl> <dbl> <dbl>
1 6.69 -0.986 -0.532
2 -0.837 0.372 -0.431
3 1.79 -0.232 -0.586
4 3.95 -1.22 1.10
5 0.662 0.266 -0.653
6 1.43 -0.602 0.382
7 1.77 0.280 -1.98
8 -7.97 1.87 -0.536
9 1.11 0.441 -0.363
10 5.24 -0.942 -0.367
# ℹ 90 more rows

y x1 x2
y 1.0000000 -0.9075633 -0.4952396
x1 -0.9075633 1.0000000 0.1272063
x2 -0.4952396 0.1272063 1.0000000
Collinearity: describes the presence of linear relationships between the explanatory variables.
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]
\[ Var(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \]
\[ Var(\hat{\beta}_j) = \frac{\sigma^2}{(n-1) Var(X_j)} \cdot \frac{1}{1 - R^2_j} \]
where \(R^2_j\) is the \(R^2\) from regressing \(X_j\) on other \(X\). The second term is called the Variance Inflation Factor (VIF).
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1166074 0.09916054 -1.175945 2.424949e-01
x1 -4.9560576 0.10187288 -48.649429 6.050966e-70
x2 -2.0795123 0.09505426 -21.877109 2.747300e-39
This linear model corresponds to a plane in 3D.
# A tibble: 100 × 3
y x1 x2
<dbl> <dbl> <dbl>
1 5.62 -0.950 -0.792
2 -1.55 0.149 0.349
3 -13.4 1.67 1.62
4 -10.4 1.66 1.54
5 -4.01 0.667 0.743
6 7.63 -0.994 -0.983
7 -3.27 0.341 0.232
8 6.35 -0.905 -0.720
9 -14.2 1.87 1.82
10 5.55 -1.08 -1.09
# ℹ 90 more rows

y x1 x2
y 1.0000000 -0.9885235 -0.9844989
x1 -0.9885235 1.0000000 0.9908118
x2 -0.9844989 0.9908118 1.0000000
High collinearity leads to high VIF.
# A tibble: 100 × 3
y x1 x2
<dbl> <dbl> <dbl>
1 19.4 -3 -3
2 11.2 -1 -3
3 5.33 0 -3
4 2.22 1 -3
5 -10.6 3 -3
6 18.3 -3 -1
7 7.20 -1 -1
8 3.13 0 -1
9 -4.06 1 -1
10 -13.2 3 -1
# ℹ 90 more rows

y x1 x2
y 1.0000000 -0.921607 -0.3752132
x1 -0.9216070 1.000000 0.0000000
x2 -0.3752132 0.000000 1.0000000
No collinearity leads to . . .