Problem Set 5

Question Format and Score: Crossover Design. Three weeks ago you took your midterm exam. Unbeknownst to you, the exam was also serving as a crossover experiment. There were four versions (A, B, C, D) where versions A/C shared one structure and versions B/D shared another. The treatment of interest was question format: whether a question was presented as a single compound sentence or as separate sentences/bullet points. Questions 21 and 23 each had a compound and a separated version; the two structures (A/C vs B/D) ensured that if Q21 was compound then Q23 was separated, and vice versa.

Exam versions were randomly assigned to students in blocks by session (morning or afternoon).
1. What are the experimental units in this study? What are the measurement units?
2. What is the treatment? How many levels does it have?
3. Why is this a crossover design rather than a simple completely randomized design? What is the advantage of each student receiving both formats?
4. What is the blocking factor, and why was blocking used?

Coffee and Reaction Time: Crossover Design. A researcher wants to know whether drinking coffee before a cognitive test improves reaction time. She recruits 20 subjects and uses a crossover design: each subject takes the cognitive test twice, once after drinking coffee and once after drinking decaf (a placebo). The order (coffee-first vs decaf-first) is randomly assigned, with a one-week washout period between sessions.
1. What is the primary advantage of this crossover design compared to a completely randomized design that assigns 10 subjects to coffee and 10 to decaf?
2. The researcher is concerned about a carryover effect: caffeine from the first session might still affect performance in the second session even after a week. Explain why it is a threat to this crossover design.
3. If carryover effects are present and asymmetric (i.e., the carryover from coffee to decaf is different from the carryover from decaf to coffee), explain why the crossover estimator \(\bar{Z}\) (the average of the observed differences) from the previous problem would be biased.
4. One way to test for carryover is to compare the total response (session 1 + session 2) between the two sequence groups. Explain the logic behind this test: under no carryover, why should the sequence groups have the same expected total?

Coffee and Reaction Time: Analysis. The researcher from the previous problem ran her crossover experiment and collected data. You can generate a synthetic version of her dataset with the following code.
```
library(tidyverse)
set.seed(40)
n <- 20
coffee <- tibble(
    subject  = 1:n,
    sequence = rep(c("coffee_first", "decaf_first"), each = n / 2),
    subject_ability = rnorm(n, mean = 250, sd = 30)
) |>
    mutate(
        period1 = case_when(
            sequence == "coffee_first" ~ subject_ability - 8 + rnorm(n, 0, 10),
            sequence == "decaf_first"  ~ subject_ability + rnorm(n, 0, 10)
        ),
        period2 = case_when(
            sequence == "coffee_first" ~ subject_ability + 5 + rnorm(n, 0, 10),
            sequence == "decaf_first"  ~ subject_ability - 8 + 5 + rnorm(n, 0, 10)
        )
    ) |>
    select(subject, sequence, period1, period2)
```
Here, period1 and period2 are reaction times (in milliseconds) on the cognitive test in session 1 and session 2 respectively. We’ll denote these \(Y_{i1}\) and \(Y_{i2}\) for subject \(i\). Lower is better. The data-generating process includes a true treatment effect of coffee (coffee lowers reaction time by 8 ms) and a period effect (reaction time increases by 5 ms in period 2, perhaps due to fatigue or reduced novelty). There is no carryover effect.
1. For each subject, compute \(Z_i\), the within-subject difference in reaction time (decaf \(-\) coffee). Be careful: which period corresponds to the coffee score depends on the subject’s sequence. Add this column to the data frame and compute \(\bar{Z}\).
2. Conduct a randomization test of the sharp null hypothesis that coffee has no effect on any subject’s reaction time.
  1. State the null hypothesis in terms of potential outcomes.
  2. Simulate 1,000 test statistics under the null by re-randomizing the sequence assignment (a complete randomization of 10 to each sequence).
  3. Plot the null distribution with the observed statistic and report the two-sided p-value.
3. Reshape the data into long format (one row per subject per period) and fit a linear model with fixed effects for subject:
  
  \[Y_{ij} = \beta_0 + \beta_1 \text{treatment}_{ij} + \beta_2 \text{period}_{ij} + \alpha_i + \varepsilon_{ij}\]
  
  where \(\alpha_i\) is a fixed effect for subject \(i\). Report and interpret \(\hat{\beta}_1\). Compare this with the result from the randomization test.
4. Now fit a simpler model that omits the period effect: \(Y_{ij} = \beta_0 + \beta_1 \text{treatment}_{ij} + \alpha_i + \varepsilon_{ij}\). How does the estimate of the treatment effect change? Explain why the crossover design protects the treatment effect estimate from period effects even when period is not included in the model.

Sequential Probability Ratio Test: Tinkering with Parameters. The SPRT is a powerful tool, but its performance depends on the choice of parameters. In this exercise, we will explore how changing the parameters \(p_0\), \(p_1\), \(\alpha\), and \(\beta\) affects the behavior of the test that was shown in class¹.
1. Using the same simulation of a single run that was shown in class, lower both the error rates (\(\alpha\) and \(\beta\)) to represent a more stringent test. Plot the same simulated run with the new thresholds. How does the decision process change with the new thresholds? Does it take more or fewer samples to reach a decision?
2. Choose a fixed value for \(p_0\) (e.g., 0.5) and vary \(p_1\) (e.g., 0.6, 0.7, 0.8). For each value of \(p_1\), use a full simulation to calculate the expected number of samples needed to reach a decision under both hypotheses.
3. Now, fix \(p_1\) (e.g., 0.7) and vary \(p_0\) (e.g., 0.5, 0.4, 0.3). Again, calculate the expected number of samples needed for each scenario.
4. Summarize your findings. How do the choices of \(p_0\) and \(p_1\) affect the efficiency of the SPRT? What trade-offs do you observe when adjusting the error rates \(\alpha\) and \(\beta\)?

Adaptive Assignment: Implementing UCB. A pharmaceutical company is running a clinical trial comparing two pain relief drugs. Each patient reports a pain relief score on a continuous scale from 0 to 1 (higher is better). The company wants to use an adaptive assignment strategy that balances learning which drug is better with assigning more patients to the better-performing drug.

The following code generates potential outcomes for \(N = 30\) subjects under both treatments.
```
library(tidyverse)
set.seed(42)
N <- 30
po <- tibble(
  subject = 1:N,
  Y_A = rbeta(N, 4, 6),
  Y_B = rbeta(N, 5, 5)
)
```
Recall the UCB algorithm with Hoeffding’s Inequality. The upper confidence bound for arm \(j\) at time \(t\) is:

\[ U_j = \hat{\mu}_j + \sqrt{\frac{\ln(t)}{2\, n_j}} \]

where \(\hat{\mu}_j\) is the sample mean of responses observed so far for arm \(j\), \(n_j\) is the number of subjects assigned to arm \(j\) so far, and \(t\) is the index of the current subject. At each step, the algorithm assigns the next subject to the arm with the highest \(U_j\).
1. Write code to implement the UCB algorithm on these potential outcomes. Initialize by assigning the first subject to arm A and the second to arm B, then use the UCB rule for subjects 3 through 30². Store the results in a data frame with columns for the subject index, assignment, and observed response.
1. Create a plot of the cumulative fraction of subjects assigned to arm B (\(n_B / t\)) as a function of the subject index \(t\). Describe what you see: when is the algorithm exploring (assigning roughly evenly) and when is it exploiting (favoring one arm)?
2. For a single time step of your choosing between \(t = 5\) and \(t = 10\), report the values of \(\hat{\mu}_A\), \(\hat{\mu}_B\), the addition to the mean (aka the Hoeffding bonus), and the resulting \(U_j\). Verify that the algorithm’s assignment matches the arm with the higher \(U_j\). What role does the bonus term play when one arm has been sampled much less than the other?
3. Compute the total regret of the UCB algorithm on these 30 subjects and compare it with the expected total regret of a balanced completely randomized design that assigns 15 subjects to each arm³. The per-subject regret is \(\max(Y_i(A), Y_i(B)) - Y_i(d_i)\), where \(d_i\) is the arm actually assigned. Does the UCB algorithm achieve lower total regret than what we would expect had we used a balanced CR design?

Footnotes

See the slides for Sequential Analysis I for code to simulate the performance of the SPRT.↩︎
Start off by simply writing the code to calculate \(U_j\) after observing the first two subjects, and then assign the third subject accordingly. Next, wrap that logic in a loop to handle all subjects up to 30.↩︎
To compute the expected total regret of a balanced CR design, you can simulate many random assignments of 15 subjects to each arm, compute the total regret for each simulation, and then take the average.↩︎

Other Formats

Footnotes