7 Multiple Imputation
Multiple imputation is a common approach to addressing missing data issues. When there is missing data, the default results are often obtained with complete case analysis (using only observations with complete data) can produce biased results though not always. Additionally, complete case analysis can have a severe negative effect on the power by greatly reducing the sample size.
Imputation in general is the idea of filling in missing values to simulate having complete data. Some simpler forms of imputation include:
- Mean imputation. Replace each missing value with the mean of the variable for all non-missing observations.
- Cold deck imputation. Replace each missing value with the value from another observation which is similar to the one with the missing value.
- Regression imputation. Fit a regression model and replace each missing value with its predicted value.
There are various pros and cons to each approach, but in general, none are as powerful or as commonly used as multiple imputation. Multiple imputation (or MI) is a three step procedure:
- For each missing value, obtain a distribution for it. Sample from these distributions to obtain imputed values that have some randomness built in. Do this repeatedly to create \(M\) total imputed data sets. Each of these \(M\) data sets is identical on non-missing values but will (almost certainly) differ on the imputed values.
- Perform your statistical analysis on each of the \(M\) imputed data sets separately.
- Pool your results together in a specific fashion to account for the uncertainty in imputations.
Thankfully, for simple analyses (e.g. most regression models), Stata will perform all three steps for you automatically. We will briefly discuss later how to perform MI if Stata doesn’t support it.
7.1 Missing at random
There can be many causes of missing data. We can classify the reason data is missing into one of three categories:
- Missing completely at random (MCAR): This is missingness that is truly random - there is no cause of the missingness, it’s just due to chance. For example, you’re entering paper surveys into a spreadsheet and spill coffee on them, obscuring a few answers.
- Missing at random (MAR): The missingness here is due to observed data but not unobserved data. For example, women may be less likely to report their age, regardless of what their actual age is.
- Missing not at random (MNAR): Here the missingness is due to the missing value. For example, individuals with higher salary may be less willing to answer survey questions about their salary.
There is no statistical test1 to distinguish between these categories; instead you must use your knowledge of the data and its collection to argue which category it falls under.
This is important because most imputation methods (including MI) require MCAR or MAR for the data. If the data is MNAR, there is very little you can do. Generally if you believe the data is MNAR, you can assume MAR but discuss that a severe limitation of your analysis is the MAR assumption is likely invalid.
7.2 mi
The mi
set of commands in Stata perform the steps of multiple imputation. There are three steps, with a preliminary step to examine the missingness. We’ll be using the “mheart5” data from Stata’s website which has some missing data.
webuse mheart5, clear
. data)
(Fictional heart attack
describe, short
.
data from https://www.stata-press.com/data/r18/mheart5.dta
Contains data
Observations: 154 Fictional heart attack
Variables: 6 19 Jun 2022 10:50by:
Sorted
summarize
.
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
attack | 154 .4480519 .4989166 0 1
smokes | 154 .4155844 .4944304 0 1
age | 142 56.43324 11.59131 20.73613 83.78423
bmi | 126 25.23523 4.029325 17.22643 38.24214
female | 154 .2467532 .4325285 0 1
-------------+--------------------------------------------------------- hsgrad | 154 .7532468 .4325285 0 1
We see from the summary that both age
and bmi
have some missing data.
7.2.1 Setting up data
We need to tell Stata how we’re going to be doing the imputations. First, use the mi set
command to determine how the multiple data sets will be stored. Really which option you choose is up to you, I prefer to “flong
” option, where each imputed data set is stacked on top of each other. If you have very large data, you might prefer “wide
”, “mlong
” or “mlongsep
”, the last of which stores each imputed data set in a separate file. See help mi styles
for more details. (Ultimately the decision is not that important, as you can switch later using mi convert <new style>
.)
mi set flong .
Next, we need to tell Stata what each variable will be used for. The options are
- imputed: A variable with missing data that needs to be imputed.
- regular: Any variable that is complete or does not need imputation.
Technically we only need specify the imputed variables, as anything unspecified is assumed to be regular. We saw above that age
and bmi
have missing values:
mi register imputed age bmi
. m=0 obs now marked as incomplete) (28
We can examine our setup with mi describe
:
mi describe
.
Style: flonglast mi update 17aug2023 08:49:00, 0 seconds ago
Observations:
Complete 126
Incomplete 28 (M = 0 imputations)
---------------------
Total 154
Variables:
Imputed: 2; age(12) bmi(28)
Passive: 0
Regular: 0
System: 3; _mi_m _mi_id _mi_miss
(there are 4 unregistered variables; attack smokes female hsgrad)
We see 126 complete observations with 28 incomplete, the two variables to be imputed, and the 4 unregistered variables which will automatically be registered as regular.
7.2.1.1 Imputing transformations
What happens if you had a transform of a variable? Say you had a variable for salary, and wanted to use a log transformation?
You can find literature suggesting either transforming first and then imputing, or imputing first and then transforming. Our suggestion, following current statistical literature is to transform first, impute second. (Von Hippel 2009)
Stata technically supports the other option via mi register passive
, but we don’t recommend it’s usage. Instead, transform your original data, then flag both the variable and its transformations as “imputed”
7.2.2 Performing the imputation
Now that we’ve got the MI set up, we can perform the actual procedure. There are a very wide number of variations on how this imputation can be done (including defining your own!). You can see these as the options to mi impute
. We’ll just be focusing on the “chained” approach, which is a good approach to start with.
The syntax for this is a bit complicated, but straightforward once you understand it.
mi impute chained (<method 1>) <variables to impute with method 1> ///
impute with method 2> ///
(<method 2>) <variables to all non-imputed variables>, add(<number of imputations>) = <
The <methods>
are essentially what type of model you would use to predict the outcome. For example, for continuous data, use regress
. For binary data use logit
. It also supports ologit
(ordinal logistic regression, multiple categories with ordering), mlogit
(multinomial logistic regression, multiple categories without ordering), poisson
or nbreg
(poisson regression or negative binomial regression, for count data), as well as some others. See help mi impute chained
under “uvmethod” for the full list.
The add( )
option specifies how many imputed data sets to generate, we’ll discuss below how to choose this.
Continuing with our example might make this more clear. To perform our imputation, we would use
mi impute chained (regress) bmi age = attack smokes female hsgrad, add(5)
. note: missing-value pattern is monotone; no iteration performed.
Conditional models (monotone):regress age attack smokes female hsgrad
age: regress bmi age attack smokes female hsgrad
bmi:
Performing chained iterations ...
Multivariate imputation Imputations = 5
Chained equations added = 5m=1 through m=5 updated = 0
Imputed:
Initialization: monotone Iterations = 0in = 0
burn-
bmi: linear regression
age: linear regression
------------------------------------------------------------------m
| Observations per
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
bmi | 126 28 28 | 154
age | 142 12 12 | 154
------------------------------------------------------------------m
(Complete + Incomplete = Total; Imputed is the minimum across of the number of filled-in observations.)
Since both bmi
and age
are continuous variables, we use method regress
. Imagine if we were also imputing smokes
, a binary variable. Then the imputation (after running mi register imputed smokes
) would be:
mi impute chained (regress) bmi age (logit) smokes = attack female hsgrad, add(5)
Here, regress
was used for bmi
and age
, and logit
was used for smokes
.
7.2.2.1 Choosing the number of imputations
Classic literature has suggested you need only 5 imputations to obtain valid results. This will address the efficiency of point estimates, but not standard errors. More modern literature increases this number, with a good starting point being 200 imputations. (Graham, Olchowski, and Gilreath 2007; White, Royston, and Wood 2011)
If your data set is large and the imputation is slow, a recent paper (Von Hippel 2009) gives a two-stage procedure to estimate the required number of imputations. This two-stage procedure first performs a small number of imputations and carries out the analysis. It then using the results of that analysis to inform a better estimate of the required sample size. You can install the user command how_many_imputations
for details and examples
ssc install how_many_imputations
help how_many_imputations
7.2.2.2 Variables created by mi
After you’ve performed your imputation2, three new variables are added to your data, and your data gets \(M\) additional copies of itself. In the example above, we added 5 imputations, so there are a total of 6 copies of the data - the raw data (with the missing values), and 5 copies with imputed values. The new variables added are:
_mi_id
is the ID number of each row corresponding to its position in the original data_mi_miss
flags whether the row had missing data originally._mi_m
is which data-set we’re looking at. 0 represents the unimputed data, 1 represents the first imputation, 2 the second, etc.
7.2.3 Analyzing mi
data
Now that we’ve got the data set up for multiple imputations, and done the imputation, most of the hard part is over. Analyzing MI data is straightforward, usually. (When it isn’t, you can do this manually.)
Basically, take any analysis command you would normally run, e.g. regress y x
, and preface it by mi estimate:
. Let’s try to predict the odds of a heart attack based upon other characteristics in the data. We would run a logistic regression model,
logit attack smokes age bmi female hsgrad
So to run it with multiple imputations:
mi estimate: logit attack smokes age bmi female hsgrad
.
estimates Imputations = 5
Multiple-imputation of obs = 154
Logistic regression Number
Average RVI = 0.0966
Largest FMI = 0.2750sample DF: min = 62.83
DF adjustment: Large
avg = 53,215.09max = 146,351.98
F test: Equal FMI F( 5, 1243.8) = 2.90
Model VCE type: OIM Prob > F = 0.0130
Within
------------------------------------------------------------------------------
attack | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
smokes | 1.163433 .352684 3.30 0.001 .47217 1.854695
age | .0284627 .0164787 1.73 0.086 -.0040684 .0609938
bmi | .0800942 .0491285 1.63 0.108 -.0180864 .1782749
female | -.0970499 .4091373 -0.24 0.812 -.8989527 .7048528
hsgrad | .10968 .3991282 0.27 0.783 -.6726034 .8919634_cons | -4.390356 1.598513 -2.75 0.006 -7.531833 -1.248878
------------------------------------------------------------------------------
We see a single model, even though 5 models (one for each imputation) were run in the background. The results from these models were pooled using something called “Rubin’s rules” to produce a single model output.
We see a few additional fit summaries about the multiple imputation that aren’t super relevant; but otherwise all the existing interpretations hold. Note that an \(F\)-test instead of \(\chi^2\) test is run, but still tests the same hypothesis that all coefficients are identically zero. Among the coefficients, we see that smokers have significantly higher odds of having a heart attack, and there’s some weak evidence that age plays a role.
7.2.3.1 MI Postestimation
In general, most postestimation commands will not work after MI. The general approach is to do the MI manually and run the postestimation for each imputation. One exception is that mi predict
works how predict
does.
7.3 Manual MI
Since we set the data as flong
, each imputed data set lives in the data with a separate _mi_m
value. You can conditionally run analyses on each, e.g.
logit attack smokes age bmi female hsgrad if _mi_m == 0
to run the model on only the original data.
It is tedious to do this over all imputed data, so instead we can run mi xeq:
as a prefix to run a command on each separate data set. This is similar to mi estimate:
except without the pooling.
mi xeq: summ age
.
m=0 data:
-> summ age
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
age | 142 56.43324 11.59131 20.73613 83.78423
m=1 data:
-> summ age
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
age | 154 56.20732 11.61166 20.73613 83.78423
m=2 data:
-> summ age
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
age | 154 55.79566 11.88629 16.9347 83.78423
m=3 data:
-> summ age
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
age | 154 56.35074 11.50551 20.73613 83.78423
m=4 data:
-> summ age
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
age | 154 56.35633 11.8424 20.73613 86.11715
m=5 data:
-> summ age
dev. Min Max
Variable | Obs Mean Std.
-------------+--------------------------------------------------------- age | 154 56.40651 11.44234 20.73613 83.78423
This can also be useful if the analysis you want to execute is not supported by mi estimate
yet.
7.3.1 Rubin’s rules
If you wanted to pool the results yourself, you can obtain an estimate for the pooled parameter by simple average across imputations. The formula for variance is slightly more complicated so we don’t produce it here, however it can be found in the “Methods and formulas” section of the MI manual (run help mi estimate
, click on “[MI] mi estimate” at the top of the file to open the manual.
7.4 Removing the MI data
Ideally, you should save the data (or preserve
it) prior to imputing, so you can easily recover the unimputed data if you wish. If you wanted to return to the original data, the following should work:
mi unset
drop if mi_m != 0
drop mi_*
The first tells Stata not to treat it as imputed anymore; the second drops all imputed data sets; the third removes the MI variables that were generated.
This only works for mi set flong
; if you use another method, you can tweak the above or use mi convert flong
to switch to “flong” first.
7.5 Survey and multiple imputation
Just a quick note, if you want to utilize by complex survey design and multiple imputation simultaneously, proper ordering needs to be given. Note that only weights play a role in multiple imputation.
mi set ...
mi svyset ...
mi impute ... [pweight = weight]
mi estimate: svy: regress ...
[pweight = ...]
part of the mi impute
command. In either case, estimation commands still need both the mi estimate: svy:
prefixes in that order.
The above paragraph is no longer accurate. See Reist and Larsen (2012). Survey weights must be used in mulitple imputations.
7.6 References
There is technically Little’s MCAR test to compare MCAR vs MAR, but the majority of imputation methods require only MAR, not MCAR, so it’s of limited use. Additionally, it is not yet supported in Stata.↩︎
Technically this happens as soon as you run
mi set
, but they’re not interesting until aftermi impute
.↩︎