Power Analysis Simulation Example
There a lot of resources online about carrying out power analysis with
tools such as G*Power for
closed-form tests (e.g. a lot of univariate/bivariate tests like t-test
or chi-square, and perhaps some simple regression) or tools like the R
package
However, what I can't find online, are
good
The Setup
Let's make up an experiment. Say we're studying whether participants are more likely to believe a statement spoken by someone older or younger. We bring participants into the lab, and show them some videos - in half the videos, the subject is young, and in half, the subject is old. We'll also assume there's some external intervention that produces a treatment and control group; perhaps the treatment group are first primed on the laws about age bias.
To make this concrete we'll have the following design:
-
12 total videos (
trial
); each video has a unique presenter. Each participant sees all 12 videos (randomized order). -
6 of those videos have young presenters (
condition = 0
) and half have older presenters (condition = 1
). -
We'll record participant age (
age
). -
Half the participants will be shown a video about age bias
(
z = 1
) and the others will not (z = 0
). - We'll assume there's a randomization to the duration of the time between being shown the video and being asked the correctness; and it's somehow standardized.
-
For each video, partipants will either believe or disbelieve the
statement in the video (
y
). -
Our primary goal of interest will be determining whether there is an
interaction between
z
andcondition
.
To account for between-participant bias, we'll fit a mixed effects logistic regression model:
i represents participant, and j their trial. So we have
variables at all levels - particpant level (age
), trial
level (condition
), and participant-trial level (duration
,
and obviously y
). represents the
random intercept for participant (their overall likelihood of believing
a statement).
Simulating Data
First, here's some code to generate a simulated data set. Note that "proper" R programming would be to pass the parameters in as arguments to the function, and in all other cases I strongly suggest you do that, but in simulations like this, it's often nice to define them at the top like this.
How did I come up with these values? This is where power analysis becomes impossible - they're guesses. Simple as that. Sometimes you may have some informed guesses (previous results, pilot studies, common sense) but still guesses. And if your guesses are wrong (and they are ALWAYS wrong), then your power analysis is wrong. But hopefully they're close enough to the "truth" to be useful.
n_subj <- 100
n_trial <- 12
beta_intercept <- -1
beta_z <- .4
beta_condition <- .8
beta_z_condition <- .4
beta_duration <- .3
beta_age <- .002
random_int_sd <- .5
simulate <- function() {
simdata <- data.frame(id = rep(seq_len(n_subj), each = n_trial),
trial = rep(seq_len(n_trial), times = n_subj),
z = rep(0:1, each = (n_subj/2)*n_trial))
## trial specific condition
simdata$condition <- rep(rep(0:1, each = 6), times = n_subj)
## subject specific age
age <- round(runif(n_subj, 18, 50))
simdata$age <- rep(age, each = n_trial)
## subject-trial specific duration
simdata$duration <- rnorm(n_trial*n_subj, .5, .1)
simdata$duration[simdata$duration < .1] <- .1 # Just in case we have any
# extreme values
simdata$random_intercept <- rep(rnorm(n_subj, 0, random_int_sd), each = n_trial)
simdata$logit <- with(simdata,
beta_intercept +
beta_z * z +
beta_condition * condition +
beta_z_condition * condition * z +
beta_duration * duration +
beta_age * age)
simdata$p <- 1 / (1 + exp(-simdata$logit))
simdata$y <- rbinom(n_subj * n_trial, 1, simdata$p)
return(simdata)
}
Now, we can look at a simulated data set:
> head(simulate(), 12)
id trial z condition age duration random_intercept logit p y
1 1 1 0 0 33 0.4043053 -0.4704536 -0.812708414 0.3073136 1
2 1 2 0 0 33 0.5258304 -0.4704536 -0.776250893 0.3151285 0
3 1 3 0 0 33 0.6370943 -0.4704536 -0.742871706 0.3223765 1
4 1 4 0 0 33 0.6574723 -0.4704536 -0.736758324 0.3237134 0
5 1 5 0 0 33 0.6612624 -0.4704536 -0.735621287 0.3239624 1
6 1 6 0 0 33 0.4414852 -0.4704536 -0.801554431 0.3096931 1
7 1 7 0 1 33 0.3395555 -0.4704536 -0.032133336 0.4919674 1
8 1 8 0 1 33 0.5837512 -0.4704536 0.041125369 0.5102799 1
9 1 9 0 1 33 0.4559606 -0.4704536 0.002788174 0.5006970 0
10 1 10 0 1 33 0.5712586 -0.4704536 0.037377583 0.5093433 0
11 1 11 0 1 33 0.4007802 -0.4704536 -0.013765941 0.4965586 1
12 1 12 0 1 33 0.5567252 -0.4704536 0.033017569 0.5082536 0
Run the Model
Next, we need to run the model against the simulated data to obtain the p-value.
fit_model <- function() {
simdata <- simulate()
## Fit model
tryCatch({
model <- glmer(y ~ condition * z + duration + age + (1 | id),
family = binomial,
data = simdata)
## Extract p-value for fixed effect
fixed_effects <- summary(model)$coefficients
p_value <- fixed_effects["condition:z", "Pr(>|z|)"]
return(p_value)
}, error = function(e) {
return(NA)
})
}
Note the use of the tryCatch
there. That says "if this code
runs, great, just do it. If it crashses, do the error =
bit
instead, which returns NA
. Otherwise, an error would stop
your simulation prematurely.
> fit_model()
[1] 0.06101131
> fit_model()
[1] 0.3603404
> fit_model()
[1] 0.1031708
Run the simulation
Finally, we just run this some large number of times and calculate the percentage of significant p-values - that's your power. (This will be slow!)
> result <- replicate(1000, fit_model())
> cat(paste("\nEstimated power:", round(mean(result < .05, na.rm = TRUE), 2), "\n"))
Estimated power: 0.34
Important: do not use set.seed
until you are ready to
submit your work!. The point of a simulation is the randomness; if
you set a seed, you break the randomness, and are no longer protected.
Run the entire procedure 3-4 times to ensure you get consistent results;
if not, increase the number of simulations.
Also, remember, this is the power fit_model()
and
re-run it.