7 Multiple Imputation

Multiple imputation is a common approach to addressing missing data issues. When there is missing data, the default results are often obtained with complete case analysis (using only observations with complete data) can produce biased results though not always. Additionally, complete case analysis can have a severe negative effect on the power by greatly reducing the sample size.

Imputation in general is the idea of filling in missing values to simulate having complete data. Some simpler forms of imputation include:

Mean imputation. Replace each missing value with the mean of the variable for all non-missing observations.
Cold deck imputation. Replace each missing value with the value from another observation which is similar to the one with the missing value.
Regression imputation. Fit a regression model and replace each missing value with its predicted value.

There are various pros and cons to each approach, but in general, none are as powerful or as commonly used as multiple imputation. Multiple imputation (or MI) is a three step procedure:

For each missing value, obtain a distribution for it. Sample from these distributions to obtain imputed values that have some randomness built in. Do this repeatedly to create \(M\) total imputed data sets. Each of these \(M\) data sets is identical on non-missing values but will (almost certainly) differ on the imputed values.
Perform your statistical analysis on each of the \(M\) imputed data sets separately.
Pool your results together in a specific fashion to account for the uncertainty in imputations.

Thankfully, for simple analyses (e.g. most regression models), Stata will perform all three steps for you automatically. We will briefly discuss later how to perform MI if Stata doesn’t support it.

7.1 Missing at random

There can be many causes of missing data. We can classify the reason data is missing into one of three categories:

Missing completely at random (MCAR): This is missingness that is truly random - there is no cause of the missingness, it’s just due to chance. For example, you’re entering paper surveys into a spreadsheet and spill coffee on them, obscuring a few answers.
Missing at random (MAR): The missingness here is due to observed data but not unobserved data. For example, women may be less likely to report their age, regardless of what their actual age is.
Missing not at random (MNAR): Here the missingness is due to the missing value. For example, individuals with higher salary may be less willing to answer survey questions about their salary.

There is no statistical test¹ to distinguish between these categories; instead you must use your knowledge of the data and its collection to argue which category it falls under.

This is important because most imputation methods (including MI) require MCAR or MAR for the data. If the data is MNAR, there is very little you can do. Generally if you believe the data is MNAR, you can assume MAR but discuss that a severe limitation of your analysis is the MAR assumption is likely invalid.

7.2 `mi`

The mi set of commands in Stata perform the steps of multiple imputation. There are three steps, with a preliminary step to examine the missingness. We’ll be using the “mheart5” data from Stata’s website which has some missing data.

. webuse mheart5, clear
(Fictional heart attack data)

. describe, short

Contains data from https://www.stata-press.com/data/r18/mheart5.dta
 Observations:           154                  Fictional heart attack data
    Variables:             6                  19 Jun 2022 10:50
Sorted by: 

. summarize

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      attack |        154    .4480519    .4989166          0          1
      smokes |        154    .4155844    .4944304          0          1
         age |        142    56.43324    11.59131   20.73613   83.78423
         bmi |        126    25.23523    4.029325   17.22643   38.24214
      female |        154    .2467532    .4325285          0          1
-------------+---------------------------------------------------------
      hsgrad |        154    .7532468    .4325285          0          1

We see from the summary that both age and bmi have some missing data.

7.2.1 Setting up data

We need to tell Stata how we’re going to be doing the imputations. First, use the mi set command to determine how the multiple data sets will be stored. Really which option you choose is up to you, I prefer to “flong” option, where each imputed data set is stacked on top of each other. If you have very large data, you might prefer “wide”, “mlong” or “mlongsep”, the last of which stores each imputed data set in a separate file. See help mi styles for more details. (Ultimately the decision is not that important, as you can switch later using mi convert <new style>.)

. mi set flong

Next, we need to tell Stata what each variable will be used for. The options are

imputed: A variable with missing data that needs to be imputed.
regular: Any variable that is complete or does not need imputation.

Technically we only need specify the imputed variables, as anything unspecified is assumed to be regular. We saw above that age and bmi have missing values:

. mi register imputed age bmi
(28 m=0 obs now marked as incomplete)

We can examine our setup with mi describe:

. mi describe

Style: flong
       last mi update 17aug2023 08:49:00, 0 seconds ago

Observations:
   Complete          126
   Incomplete         28  (M = 0 imputations)
   ---------------------
   Total             154

Variables:
   Imputed: 2; age(12) bmi(28)

   Passive: 0

   Regular: 0

   System:  3; _mi_m _mi_id _mi_miss

   (there are 4 unregistered variables; attack smokes female hsgrad)

We see 126 complete observations with 28 incomplete, the two variables to be imputed, and the 4 unregistered variables which will automatically be registered as regular.

7.2.1.1 Imputing transformations

What happens if you had a transform of a variable? Say you had a variable for salary, and wanted to use a log transformation?

You can find literature suggesting either transforming first and then imputing, or imputing first and then transforming. Our suggestion, following current statistical literature is to transform first, impute second. (Von Hippel 2009)

Stata technically supports the other option via mi register passive, but we don’t recommend it’s usage. Instead, transform your original data, then flag both the variable and its transformations as “imputed”

7.2.2 Performing the imputation

Now that we’ve got the MI set up, we can perform the actual procedure. There are a very wide number of variations on how this imputation can be done (including defining your own!). You can see these as the options to mi impute. We’ll just be focusing on the “chained” approach, which is a good approach to start with.

The syntax for this is a bit complicated, but straightforward once you understand it.

mi impute chained (<method 1>) <variables to impute with method 1> ///
                  (<method 2>) <variables to impute with method 2> ///
                  = <all non-imputed variables>, add(<number of imputations>)

The <methods> are essentially what type of model you would use to predict the outcome. For example, for continuous data, use regress. For binary data use logit. It also supports ologit (ordinal logistic regression, multiple categories with ordering), mlogit (multinomial logistic regression, multiple categories without ordering), poisson or nbreg (poisson regression or negative binomial regression, for count data), as well as some others. See help mi impute chained under “uvmethod” for the full list.

The add( ) option specifies how many imputed data sets to generate, we’ll discuss below how to choose this.

Continuing with our example might make this more clear. To perform our imputation, we would use

. mi impute chained (regress) bmi age = attack smokes female hsgrad, add(5)
note: missing-value pattern is monotone; no iteration performed.

Conditional models (monotone):
               age: regress age attack smokes female hsgrad
               bmi: regress bmi age attack smokes female hsgrad

Performing chained iterations ...

Multivariate imputation                     Imputations =        5
Chained equations                                 added =        5
Imputed: m=1 through m=5                        updated =        0

Initialization: monotone                     Iterations =        0
                                                burn-in =        0

               bmi: linear regression
               age: linear regression

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
               bmi |        126           28        28 |       154
               age |        142           12        12 |       154
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

Since both bmi and age are continuous variables, we use method regress. Imagine if we were also imputing smokes, a binary variable. Then the imputation (after running mi register imputed smokes) would be:

mi impute chained (regress) bmi age (logit) smokes = attack female hsgrad, add(5)

Here, regress was used for bmi and age, and logit was used for smokes.

7.2.2.1 Choosing the number of imputations

Classic literature has suggested you need only 5 imputations to obtain valid results. This will address the efficiency of point estimates, but not standard errors. More modern literature increases this number, with a good starting point being 200 imputations. (Graham, Olchowski, and Gilreath 2007; White, Royston, and Wood 2011)

If your data set is large and the imputation is slow, a recent paper (Von Hippel 2009) gives a two-stage procedure to estimate the required number of imputations. This two-stage procedure first performs a small number of imputations and carries out the analysis. It then using the results of that analysis to inform a better estimate of the required sample size. You can install the user command how_many_imputations for details and examples

ssc install how_many_imputations
help how_many_imputations

7.2.2.2 Variables created by `mi`

After you’ve performed your imputation², three new variables are added to your data, and your data gets \(M\) additional copies of itself. In the example above, we added 5 imputations, so there are a total of 6 copies of the data - the raw data (with the missing values), and 5 copies with imputed values. The new variables added are:

_mi_id is the ID number of each row corresponding to its position in the original data
_mi_miss flags whether the row had missing data originally.
_mi_m is which data-set we’re looking at. 0 represents the unimputed data, 1 represents the first imputation, 2 the second, etc.

7.2.3 Analyzing `mi` data

Now that we’ve got the data set up for multiple imputations, and done the imputation, most of the hard part is over. Analyzing MI data is straightforward, usually. (When it isn’t, you can do this manually.)

Basically, take any analysis command you would normally run, e.g. regress y x, and preface it by mi estimate:. Let’s try to predict the odds of a heart attack based upon other characteristics in the data. We would run a logistic regression model,

logit attack smokes age bmi female hsgrad

So to run it with multiple imputations:

. mi estimate: logit attack smokes age bmi female hsgrad

Multiple-imputation estimates                   Imputations       =          5
Logistic regression                             Number of obs     =        154
                                                Average RVI       =     0.0966
                                                Largest FMI       =     0.2750
DF adjustment:   Large sample                   DF:     min       =      62.83
                                                        avg       =  53,215.09
                                                        max       = 146,351.98
Model F test:       Equal FMI                   F(   5, 1243.8)   =       2.90
Within VCE type:          OIM                   Prob > F          =     0.0130

------------------------------------------------------------------------------
      attack | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      smokes |   1.163433    .352684     3.30   0.001       .47217    1.854695
         age |   .0284627   .0164787     1.73   0.086    -.0040684    .0609938
         bmi |   .0800942   .0491285     1.63   0.108    -.0180864    .1782749
      female |  -.0970499   .4091373    -0.24   0.812    -.8989527    .7048528
      hsgrad |     .10968   .3991282     0.27   0.783    -.6726034    .8919634
       _cons |  -4.390356   1.598513    -2.75   0.006    -7.531833   -1.248878
------------------------------------------------------------------------------

We see a single model, even though 5 models (one for each imputation) were run in the background. The results from these models were pooled using something called “Rubin’s rules” to produce a single model output.

We see a few additional fit summaries about the multiple imputation that aren’t super relevant; but otherwise all the existing interpretations hold. Note that an \(F\)-test instead of \(\chi^2\) test is run, but still tests the same hypothesis that all coefficients are identically zero. Among the coefficients, we see that smokers have significantly higher odds of having a heart attack, and there’s some weak evidence that age plays a role.

7.2.3.1 MI Postestimation

In general, most postestimation commands will not work after MI. The general approach is to do the MI manually and run the postestimation for each imputation. One exception is that mi predict works how predict does.

7.3 Manual MI

Since we set the data as flong, each imputed data set lives in the data with a separate _mi_m value. You can conditionally run analyses on each, e.g.

logit attack smokes age bmi female hsgrad if _mi_m == 0

to run the model on only the original data.

It is tedious to do this over all imputed data, so instead we can run mi xeq: as a prefix to run a command on each separate data set. This is similar to mi estimate: except without the pooling.

. mi xeq: summ age

m=0 data:
-> summ age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        142    56.43324    11.59131   20.73613   83.78423

m=1 data:
-> summ age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        154    56.20732    11.61166   20.73613   83.78423

m=2 data:
-> summ age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        154    55.79566    11.88629    16.9347   83.78423

m=3 data:
-> summ age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        154    56.35074    11.50551   20.73613   83.78423

m=4 data:
-> summ age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        154    56.35633     11.8424   20.73613   86.11715

m=5 data:
-> summ age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |        154    56.40651    11.44234   20.73613   83.78423

This can also be useful if the analysis you want to execute is not supported by mi estimate yet.

7.3.1 Rubin’s rules

If you wanted to pool the results yourself, you can obtain an estimate for the pooled parameter by simple average across imputations. The formula for variance is slightly more complicated so we don’t produce it here, however it can be found in the “Methods and formulas” section of the MI manual (run help mi estimate, click on “[MI] mi estimate” at the top of the file to open the manual.

7.4 Removing the MI data

Ideally, you should save the data (or preserve it) prior to imputing, so you can easily recover the unimputed data if you wish. If you wanted to return to the original data, the following should work:

mi unset
drop if mi_m != 0
drop mi_*

The first tells Stata not to treat it as imputed anymore; the second drops all imputed data sets; the third removes the MI variables that were generated.

This only works for mi set flong; if you use another method, you can tweak the above or use mi convert flong to switch to “flong” first.

7.5 Survey and multiple imputation

Just a quick note, if you want to utilize by complex survey design and multiple imputation simultaneously, proper ordering needs to be given. Note that only weights play a role in multiple imputation.

mi set ...
mi svyset ...
mi impute ... [pweight = weight]
mi estimate: svy: regress ...

There has been some discussion that imputation should not take into account any complex survey design features (because you want the imputation to reflect the sample, not necessarily the population). See for example Little and Vartivarian (2003). If you follow this advice, simply exclude the [pweight = ...] part of the mi impute command. In either case, estimation commands still need both the mi estimate: svy: prefixes in that order.

The above paragraph is no longer accurate. See Reist and Larsen (2012). Survey weights must be used in mulitple imputations.

7.6 References

Graham, John W, Allison E Olchowski, and Tamika D Gilreath. 2007. “How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory.” Prevention Science 8: 206–13.

Little, Roderick J, and Sonya Vartivarian. 2003. “On Weighting the Rates in Non-Response Weights.” Statistics in Medicine 22 (9): 1589–99.

Reist, Benjamin M, and Michael D Larsen. 2012. “Post-Imputation Calibration Under Rubin’s Multiple Imputation Variance Estimator.” In Section on Survey Research Methods, Joint Statistical Meeting.

Von Hippel, Paul T. 2009. “How to Impute Interactions, Squares, and Other Transformed Variables.” Sociological Methodology 39 (1): 265–91.

White, Ian R, Patrick Royston, and Angela M Wood. 2011. “Multiple Imputation Using Chained Equations: Issues and Guidance for Practice.” Statistics in Medicine 30 (4): 377–99.

There is technically Little’s MCAR test to compare MCAR vs MAR, but the majority of imputation methods require only MAR, not MCAR, so it’s of limited use. Additionally, it is not yet supported in Stata.↩︎
Technically this happens as soon as you run mi set, but they’re not interesting until after mi impute.↩︎