Vous êtes sur la page 1sur 18

Missing Data

This discussion borrows heavily from: Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, by Jacob and Patricia Cohen (1975 edition). The 2003 edition of Cohen and Cohens book is also used a little. Paul Allisons Sage Monograph on Missing Data (Sage paper # 136, 2002). Newman, Daniel A. 2003. Longitudinal Modeling with Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques. Organizational Research Methods, Vol. 6 No. 3, July 2003 pp. 328-362. Patrick Roystons series of articles in volumes 4 and 5 of The Stata Journal on multiple imputation. See especially Royston, Patrick. 2005. Multiple Imputation of Missing Values: Update. The Stata Journal Vol. 5 No. 2, pp. 188-201. Also, Stata 11 has its own built-in commands for multiple imputation. If you have Stata 11, the entire MI manual is available as a PDF file. Often, part or all of the data are missing for a subject. This handout will describe the various types of missing data and common methods for handling it. The readings can help you with the more advanced methods. I. Types of missing data. There are several useful distinctions we can make. Dependent versus independent variables. Most methods involve missing values for IVs, although in recent years methods for dealing with missing data in the dependent variable have been developed. Random versus selective loss of data. A researcher must ask why the data are missing. In some cases the loss is completely at random (MCAR), i.e. the absence of values on an IV is unrelated to Y or other IVs. Unfortunately, in survey research, the loss often is not random. Refusal or inability to respond may be correlated with such things as education, income, interest in the subject, geographic location, etc. Selective loss of data is much more problematic than random loss. Missing by design; or, not asked or not applicable. These are special cases of random versus selective loss of data. Sometimes data are missing because the researcher deliberately did not ask the question of that particular respondent. For economic reasons, some questions might only be asked of a random subsample of the entire sample. For example there is a short version of the census (answered by everyone) and a long version that is only answered by 20%. This can be treated the same as a random loss of data, keeping in mind that the loss may be very high. Other times, skip patterns are used to only ask questions of respondents with particular characteristics. For example, only married individuals might be asked questions about family life. With this selective loss of data, you must keep in mind that the subjects who were asked questions are probably quite different than those who were not (and that the question may not have been asked of others because it would not make any sense for them).

Missing DataPage 1

In can be quite frustrating to think you've found the perfect question, only to find that 3% of your sample answered it! However, keep in mind that, many times, most subjects actually may be answering the same or similar questions, but at different points in the questionnaire. For example, married individuals may answer question 37 while unmarried individuals are asked the same thing in question 54 (perhaps with a slight change of wording to reflect the differences in marital status). Hence, it may be possible to construct a more or less complete set of data by combining responses from several questions. Often, the collectors or distributors of the data have already done this for you. Many versus few missing data and their pattern. Is only 1% of the data missing, or 40%? Is there much data missing from a few subjects or a little data missing from each of several subjects? Is the missing data concentrated on a few IVs or is it spread across several IVs? Alternatives for handling missing data


We will discuss several different alternatives here. We caution in advance that, while many of these methods have been widely used, some are very problematic and their use is not encouraged (although you should be aware of them in case you encounter them in your reading.) Compare the missing and non-missing cases on variables where information is not missing. Whatever strategy you follow you may be able to add plausibility to your results (or detect potential biases) by comparing sample members on variables that are not missing. For example, in a panel study, some respondents will not be re-interviewed because they could not be found or else refused to participate. You can compare respondents and nonrespondents in terms of demographic characteristics such as race, age, income, etc. If there are noteworthy differences, you can point them out, e.g. lower-income individuals appear to be underrepresented in the sample. Similarly, you can compare individuals who answered a question with those who failed to answer to see if there are any apparent differences between the groups. Dropping variables. When, for one or a few variables, a substantial proportion of cases lack data, the analyst may simply opt to drop the variables. This is no great loss if the variables had little effect on Y anyway. However, you presumably would not have asked the question if you did not think it was important, so just dropping variables may or may not a good solution. Still, this is often the best or at least most practical approach. A great deal of missing data for an item might indicate there are fundamental problems with it and it is hence unusable. Perhaps a question was poorly worded, or perhaps there were problems with collecting the data. Dropping subjects, i.e. listwise (also called casewise) deletion of missing data. Particularly if the missing data is limited to a small number of the subjects, you may just opt to eliminate those cases from the analysis. That is, if a subject is missing data on any of the variables used in the analysis, it is dropped completely. The remaining cases, however, may not be representative of the population. Even if data is missing on a random basis, a listwise deletion of cases could result in a substantial reduction in sample size, if many cases were missing data on at least one variable. My guess is that listwise deletion is the most common approach for handling missing data, and it often works well, but you should be aware of its
Missing DataPage 2

limitations if using it. Another thing to be careful of, when using listwise deletion, is to make sure that your selected samples remain comparable when you are doing a series of analyses. Suppose, for example, you do one regression where the IVs are X1, X2, and X3. You do a subsequent analysis with those same three variables plus X4. The inclusion of X4 (if it has missing data) could cause the sample size to decline. This could affect your tests of statistical significance. You might, for example, conclude that the effect of X3 becomes insignificant once X4 is controlled for but this could be very misleading if the change in significance was the result of a decline in sample size, rather than because of any effect X4 has. Also, as weve noted before, many statistical tests that explicitly compare models assume that the same cases are being analyzed throughout. Also, if the X4 cases are missing on a nonrandom basis, your understanding of how variable effects are interrelated could also get distorted. For example, suppose X1-X3 are asked of all respondents, but X4 is only asked of women. You might see huge changes in the estimated effects of X1-X3 once X4 was added. This might occur only because the samples analyzed are different, e.g. if you only analyzed women throughout the effects of X1-X3 might change little once X4 was added. In SPSS, you might use Select If or Filter commands to consistently limit your data to the cases you want. You can also often avoid this problem by using a single Regression command with variables entered hierarchically, e.g. first you enter X1-X3, then you enter X4. If you use two separate regression commands, you may run into trouble. Good


In Stata, you have to be a little more careful. Especially if your subsample selection is a little complicated, it may be best just to temporarily drop the cases you dont want. Some other approaches:
. reg y x1 x2 x3 if !missing(x4) . reg y x1 x2 x3 x4

The first regression will only get run on those cases which do not have missing values on x4. Another approach:

Missing DataPage 3

. gen touse = !missing(y, x1, x2, x3, x4) . reg y x1 x2 x3 if touse

The variable touse will be coded 1 if there is no missing data in any of the variables specified; otherwise it will equal 0. The if statement on the reg command will limit the analysis to cases with nonzero values on touse (i.e. the cases with data on all 5 variables). The nestreg prefix is another very good approach when you are estimating a series of nested models, e.g. first you estimate the model with x1 x2 x3, then you estimate a model with x1 x2 x3 x4 x5, etc. nestreg does listwise deletion on all the variables, and will also give you incremental F tests showing whether the variables added in each step are statistically significant, e.g.
. nestreg: reg y (x1 x2 x3) (x4 x5)

The missing-data correlation matrix, i.e. pairwise deletion of missing data. Such a matrix is computed by using for each pair of variables (Xi, Xj) as many cases as have values for both variables. That is, when data is missing for either (or both) variables for a subject, the case is excluded from the computation of rij. In general, then, different correlation coefficients are not necessarily based on the same subjects or the same number of subjects. This procedure is sensible if (and only if) the data are randomly missing. In this case, each correlation, mean, and standard deviation is an unbiased estimate of the corresponding population parameter. The varying Ns on which the regression results are based is somewhat problematic. The results are clearly not as sturdy as if the maximum N had been obtained throughout, nor as frail as the minimum N would suggest. Still, if data are missing randomly, this approach is not too bad. Unfortunately, the assumption that data are missing randomly is often not valid. If data are not missing at random, several problems develop: The pieces put together for the regression analysis refer to systematically different subsets of the population, e.g. the cases used in computing r12 may be very different than the cases used in computing r34. Results cannot be interpreted coherently for the entire population or even some discernible subpopulation. One can obtain a missing-data correlation matrix whose values are mutually inconsistent, i.e. it would be mathematically impossible to obtain such a matrix with any complete population (e.g. such a matrix might produce a multiple R of -.3!) It may be even worse, though, if you do get a consistent matrix. With an impossible matrix, you'll receive some sort of warning that the results are implausible, but with a consistent matrix the results might seem OK even though they are total nonsense. Also, even if data are missing randomly, pairwise deletion is only practical for statistical analyses where a correlation matrix can be analyzed, e.g. OLS regression. It does not work for techniques like logistic regression. Even in the case of OLS, statistical software may or

Missing DataPage 4

may not make it easy to use pairwise deletion, e.g. it is easy to do in SPSS but requires extra effort if you want to do it Stata. For these and other reasons, pairwise deletion is not widely used or recommended. I would probably feel most comfortable with it in cases where only a random subset of the sample had been asked some questions while other questions had been answered by everyone, such as in the Census. (Of course, some of the same problems that make pairwise deletion problematic may also make other alternatives problematic as well; one of the worst things about pairwise deletion may be that it can cause you to miss the fact that problems are present.) Nominal variables: Treat missing data as just another category. Suppose the variable Religion is coded 1 = Catholic, 2 = Protestant, 3 = Other. Suppose some respondents fail to answer this question. Rather than just exclude these subjects, we could just set up a fourth category, 4 = Missing Data (or no response). We could then proceed as usual, constructing three dummy variables from the four category variable of religion. This method has been popular for years but according to Allison & others, it produces biased estimates. Substituted (plugged in) values, i.e. (Single) Imputation. A common strategy, particularly if the missing data are not too numerous, is to substitute some sort of plausible guess [imputation] for the missing data. Common choices include: The overall mean An appropriate subgroup mean (e.g. the mean for blacks or for whites) A regression estimate (i.e. for the non-MD cases, regress X on other variables. Use the resulting regression equation to compute X when X is missing) In its documentation for the impute command (see examples below), the Stata 8 reference manual states that [imputation] is not the only method for coping with missing data, but it is often much better than deleting cases with any missing data, which is the default. However, many disagree with this. Note that these strategies tend to reduce variability and can artificially increase R and decrease standard errors. Also, remember that cases may be missing precisely because they are atypical, hence plugging in typical values may not be wise. According to Allison, All of these imputation methods suffer from a fundamental problem: Analyzing imputed data as though it were complete data produces standard errors that are underestimated and test statistics that are overestimated. Conventional analytic techniques simply do not adjust for the fact that the imputation process involves uncertainty about the missing values. Substituted (plugged in) value plus missing data indicator. Allison calls this Dummy variable adjustment. As is explained below, this strategy is no longer recommended, except under certain conditions in spite of what my old course notes and exams may say! This strategy proceeds as follows: Plug in some arbitrary value for all MD cases (typically 0, or the variable's mean) Include in the regression a dummy variable coded 1 if data in the original variable was missing (i.e. a value has been plugged in for MD), 0 otherwise.

Missing DataPage 5

Although any arbitrary value can be used, it is particularly convenient to use the mean when plugging in values. The t-test of the coefficient for the missing data dichotomy then (supposedly) indicates whether or not data are missing at random. According to Cohen and Cohen (1975), the advantages of this approach are avoid the risk of non-representativeness in dropping subjects if data are missing nonrandomly avoid loss of statistical power due to reduced N even if data are missing randomly capitalize on the information inherent in the absence-presence of values on the variable in question (e.g. are missing cases different from non-missing) capitalize on the information present on other variables although missing for some subjects on the variable in question HOWEVER, while this technique has been used for many years (including in this class) Allison and others have recently been critical of it. Allison calls this technique remarkably simple and intuitively appealing. But unfortunately, the method generally produces biased estimates of the coefficients. See his book for examples. In the 2003 edition of their book, Cohen and Cohen no longer advocate missing data dummies and acknowledge that they have not been widely used. NOTE!!! Buried in footnote 5 of Allisons book is a very important point that is often overlooked (Thanks to Richard Campbell from Illinois-Chicago for pointing this out to me): While the dummy variable adjustment method is clearly unacceptable when data are truly missing, it may still be appropriate in cases where the unobserved value simply does not exist. For example, married respondents may be asked to rate the quality of their marriage, but that question has no meaning for unmarried respondents. Suppose we assume that there is one linear equation for married couples and another equation for unmarried couples. The married equation is identical to the unmarried equation except that it has (a) a term corresponding to the effect of marital quality on the dependent variable and (b) a different intercept. Its easy to show that the dummy variable adjustment method produces optimal estimates in this situation. So, for example, you might have questions about mothers education and fathers education, but the father is unknown or was never part of the family. Or, you might have spouses education, but there is no spouse. In such situations, the dummy variable adjustment method may be appropriate. Conversely, if there is a true value for fathers education but it is missing, Allison says the dummy variable adjustment method should not be used. Advanced methods: Maximum Likelihood Estimation and Multiple Imputation. Allison concludes that, of the conventional methods listed above, listwise deletion often works the best. However, he argues that, under certain conditions, Maximum Likelihood Methods and Multiple Imputation Methods can work better. As Newman (2003, p. 334) notes, MI [multiple imputation] is a procedure by which missing data are imputed several times (e.g. using regression imputation) to produce several different complete-data estimates of the parameters. The parameter estimates from each imputation are then combined to give an overall estimate of the complete-data parameters as well as reasonable estimates of the standard errors. Maximum Likelihood (ML) approaches operate by estimating a set of

Missing DataPage 6

parameters that maximize the probability of getting the data that was observed (Newman, p. 332). I highly recommend reading pages 1-13 of the Stata 11 Multiple Imputation Manual for more on the theory behind multiple imputation. For users of earlier versions of Stata, check out the ice and mim commands and their documentation. I will later give a simple example using Stata 11s built in commands but you should do more reading if you want to do your own MI analysis. III.
1. Case # 1 2 3 4 5 6 7 8 9 10

Simple Example. Here is an example from a previous exam:

A researcher collected the following data: Y 30 37 41 42 45 49 51 55 58 60 X1 2 2 3 1 3 1 Missing 3 Missing 2 X2 Missing 1 1 Missing 2 2 1 2 2 Missing X3 12 Missing 20 16 Missing 27 30 33 19 24

a. Suppose the researcher believes that data are missing on a random basis, i.e. those who did not respond are no different than those who did. What would you recommend for herpairwise deletion of missing data, or listwise deletion? Why?

Listwise deletion would result in 70% of the cases being deleted. Because data are missing randomly and MD is spread across several variables, pairwise deletion might be a reasonable option in this case, or multiple imputation. Still, I would probably do further examination to find out why so many cases were missing some data, i.e. I would want to be confident that the data really are missing randomly. This might occur in situations where, say, only random subsamples are asked some questions, such as in the short and long forms of the Census questionnaire.

Missing DataPage 7

b. Suppose the researcher believes that data may be missing on a non-random basis. What would you recommend for hersubstitution of the mean for MD cases, or substitution of the mean plus including missing data dichotomies? Why?

In the past I recommended using the Cohen and Cohen method: substitute the mean for the MD cases, and then add a missing data dichotomy. A significant coefficient for the dichotomy supposedly indicated that data were missing on a non-random basis. That method has now been discredited however. The researcher probably needs to better understand the reasons data are missing before deciding on a strategy. For example, are data missing because the question was not appropriate for the respondent (e.g. questions about marital satisfaction should not be asked of people who are not married)? Are they missing because some subjects refused to talk about sensitive topics? Were there problems with the questionnaire or with the data collection? However, if the data are missing because they are non-existent (e.g. the question pertains to the spouse but there is no spouse) the dummy variable adjustment method may be appropriate. IV. Using Stata 11 to handle missing data Multiple Imputation

This example is adapted from pages 1-13 of the Stata 11 Multiple Imputation Manual and also quotes directly from the Stata 11 online help. If you have Stata 11 the entire manual is available as a PDF file. This is a simple example and there are other commands and different ways to do multiple imputation, so you should do a lot more reading if you want to use MI yourself. The file mheart0.dta is a fictional data set with 154 cases, 22 of which are missing data on bmi (Body Mass Index). The dependent variable for this example is attack, coded 0 if the subject did not have a heart attack and 1 if he or she did.
. webuse mheart0, clear (Fictional heart attack data; bmi missing) . sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------attack | 154 .4480519 .4989166 0 1 smokes | 154 .4155844 .4944304 0 1 age | 154 56.48829 11.73051 20.73613 87.14446 bmi | 132 25.24136 4.027137 17.22643 38.24214 female | 154 .2467532 .4325285 0 1 -------------+-------------------------------------------------------hsgrad | 154 .7532468 .4325285 0 1 marstatus | 154 1.941558 .8183916 1 3 alcohol | 154 1.181818 .6309506 0 2 hightar | 154 .2077922 .407051 0 1 . mi set mlong

[From the Stata 11 online help:] mi set is used to set a regular Stata dataset to be an mi dataset. An mi set dataset has the following attributes: The data are recorded in a style: wide, mlong, flong, or flongsep. Variables are registered as imputed, passive, or regular, or they are left unregistered.
Missing DataPage 8

In addition to m=0, the data with missing values, the data include M>=0 imputations of the imputed variables.

For this example, the Stata 11 Manual says we choose to use the data in the marginal long style (mlong) because it is a memory-efficient style. Type help mi styles for more details.
. mi register imputed bmi (22 m=0 obs. now marked as incomplete) . mi register regular attack smokes age hsgrad female

An imputed variable is a variable that has missing values and for which you have or will have imputations. All variables whose missing values are to be filled in must be registered as imputed variables. A passive variable (not used in this example) is a variable that is a function of imputed variables (e.g. an interaction effect) or of other passive variables. A passive variable will have missing values in m=0 (the original data set) and varying values for observations in m>0 (the imputed data sets). A regular variable is a variable that is neither imputed nor passive and that has the same values, whether missing or not, in all m; registering regular variables is optional but recommended. In the above, we are telling Stata that the values of bmi will be imputed while the values of the other variables will not be.
. mi impute regress bmi attack smokes age hsgrad female, add(20) rseed(2232) Univariate imputation Linear regression Imputed: m=1 through m=20 Imputations = added = updated = 20 20 0

| Observations per m |---------------------------------------------Variable | complete incomplete imputed | total ---------------+-----------------------------------+---------bmi | 132 22 22 | 154 -------------------------------------------------------------(complete + incomplete = total; imputed is the minimum across m of the number of filled in observations.)

The mi impute command fills in missing values (.) of a single variable or of multiple variables using the specified method. In this case, the use of regress means use a linear regression for a continuous variable; i.e. bmi is being regressed on attack smokes age hsgrad & female. The Stata 11 manual includes guidelines for choosing variables to include in the imputation model. Other methods include logit, ologit and mlogit, e.g. you would use logit if you had a binary variable you wanted to impute values for. The add option specifies the number of imputations, in this case 20. (Stata recommends using at least 20 although it is not unusual to see as few as 5.) The rseed option sets the random number seed which makes results reproducible (different seeds will produce different imputed data sets). Case 8 is the first case with missing data on bmi, so lets see what happens to it after imputation:
. list bmi attack smokes age hsgrad female _mi_id _mi_miss _mi_m if _mi_id ==8

+-------------------------------------------------------------------------------------+ | bmi attack smokes age hsgrad female _mi_id _mi_miss _mi_m | |-------------------------------------------------------------------------------------| 8. | . 0 0 60.35888 0 0 8 1 0 |

Missing DataPage 9

155. 177. 199. 221. 243. 265. 287. 309. 331. 353. 375. 397. 419. 441. 463. 485. 507. 529. 551. 573.

| 18.63577 0 0 60.35888 0 0 8 . 1 | | 22.7805 0 0 60.35888 0 0 8 . 2 | | 24.53927 0 0 60.35888 0 0 8 . 3 | | 27.39019 0 0 60.35888 0 0 8 . 4 | |-------------------------------------------------------------------------------------| | 28.23859 0 0 60.35888 0 0 8 . 5 | | 19.54897 0 0 60.35888 0 0 8 . 6 | | 33.65324 0 0 60.35888 0 0 8 . 7 | | 24.02541 0 0 60.35888 0 0 8 . 8 | | 24.24221 0 0 60.35888 0 0 8 . 9 | |-------------------------------------------------------------------------------------| | 28.78618 0 0 60.35888 0 0 8 . 10 | | 29.01385 0 0 60.35888 0 0 8 . 11 | | 25.77463 0 0 60.35888 0 0 8 . 12 | | 14.17755 0 0 60.35888 0 0 8 . 13 | | 22.73674 0 0 60.35888 0 0 8 . 14 | |-------------------------------------------------------------------------------------| | 21.77571 0 0 60.35888 0 0 8 . 15 | | 21.49317 0 0 60.35888 0 0 8 . 16 | | 27.54434 0 0 60.35888 0 0 8 . 17 | | 29.19841 0 0 60.35888 0 0 8 . 18 | | 14.83504 0 0 60.35888 0 0 8 . 19 | |-------------------------------------------------------------------------------------| | 24.38487 0 0 60.35888 0 0 8 . 20 | +-------------------------------------------------------------------------------------+

bmi is missing in the original unimputed data set (_mi_m = 0). For each of the 20 imputed data sets, a different value has been imputed for bmi. The imputation of multiple plausible values will let the estimation procedure take into account the fact that the true value is unknown and hence uncertain. The Stata 11 Manual recommends checking to see whether the imputations appear reasonable. In this case we do so by running the mi xeq command, which executes command(s) on individual imputations. Specifically, we run the summarize command on the original data set (m = 0) and on the (arbitrarily chosen) first and last imputed data sets. The means and standard deviations for bmi are all similar and seem reasonable in this case:
. mi xeq 0 1 20: summarize bmi m=0 data: -> summarize bmi Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bmi | 132 25.24136 4.027137 17.22643 38.24214 m=1 data: -> summarize bmi Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bmi | 154 25.0776 4.016672 17.22643 38.24214 m=20 data: -> summarize bmi Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------bmi | 154 25.41818 4.114603 14.71254 38.24214

The mi estimate command does estimation using multiple imputations. The desired analysis is done on each imputed data set and the results are then combined into a single multiple-

Missing DataPage 10

imputation result (the dots option just tells Stata to print a dot after each estimation; it helps you track progress and an X gets printed out if there is a problem doing one of the estimations):
. mi estimate, dots: logit attack smokes age bmi hsgrad female Imputations (20): .........10.........20 done Multiple-imputation estimates Logistic regression DF adjustment: Model F test: Within VCE type: Large sample Equal FMI OIM Imputations Number of obs Average RVI DF: min avg max F( 5,71379.3) Prob > F = = = = = = = = 20 154 0.0312 1060.38 223362.56 493335.88 3.59 0.0030

-----------------------------------------------------------------------------attack | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------smokes | 1.198595 .3578195 3.35 0.001 .4972789 1.899911 age | .0360159 .0154399 2.33 0.020 .0057541 .0662776 bmi | .1039416 .0476136 2.18 0.029 .010514 .1973692 hsgrad | .1578992 .4049257 0.39 0.697 -.6357464 .9515449 female | -.1067433 .4164735 -0.26 0.798 -.9230191 .7095326 _cons | -5.478143 1.685075 -3.25 0.001 -8.782394 -2.173892 ------------------------------------------------------------------------------

Note that you dont always get the same information as you do with non-imputed data sets (e.g. Pseudo R2), partly because these things dont always make sense with imputed data or because it is not clear how to compute them. Compare this to the results when we only analyze the original unimputed data:
. mi xeq 0: logit attack smokes age bmi hsgrad female, nolog m=0 data: -> logit attack smokes age bmi hsgrad female, nolog Logistic regression Log likelihood = -79.34221 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 132 24.03 0.0002 0.1315

-----------------------------------------------------------------------------attack | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smokes | 1.544053 .3998329 3.86 0.000 .7603945 2.327711 age | .026112 .017042 1.53 0.125 -.0072898 .0595137 bmi | .1129938 .0500061 2.26 0.024 .0149837 .211004 hsgrad | .4048251 .4446019 0.91 0.363 -.4665786 1.276229 female | .2255301 .4527558 0.50 0.618 -.6618549 1.112915 _cons | -5.408398 1.810603 -2.99 0.003 -8.957115 -1.85968 ------------------------------------------------------------------------------

The most striking difference is that the effect of age is statistically significant in the imputed data, whereas it wasnt in the original data set. Already existing MI data sets. If you are lucky, somebody else may have already done the imputation for you; and if you are super-lucky, the MI data will already be in Stata format. If not,
Missing DataPage 11

youll have to convert it to Stata yourself. The mi import command may be useful for this purpose. Also, once the data are in Stata format, the mi describe command can be used to provide a detailed report. Using the above data,
. mi describe Style: Obs.: mlong complete 132 incomplete 22 --------------------total 154 imputed: passive: 0 regular: 5; attack smokes age hsgrad female system: 3; _mi_m _mi_id _mi_miss 1; bmi(22) (M = 20 imputations)


(there are 3 unregistered variables; marstatus alcohol hightar)

Other comments Imputation is pretty easy when only one variable has missing data. It can get more complicated in the more typical case when several variables have missing data. User-written programs like ice and mim can also be used for imputation and estimation. Some people like the imputation procedures in ice better, at least under some conditions. passive imputation is somewhat controversial. With passive imputation, you would, for example, impute values for x1 and x2, and then multiply those values together to create the interaction term x1x2. The alternative is to multiply x1 * x2 before imputation, and then impute values for the resulting x1x2 interaction term. Perhaps surprisingly, some people (including Paul Allison) claim that the latter approach is superior. The issue was discussed on Stata List in February 2009. If interested, see http://www.stata.com/statalist/archive/2009-02/msg00602.html http://www.stata.com/statalist/archive/2009-02/msg00613.html

Missing DataPage 12

Appendix: Using Stata & SPSS for traditional missing data methods
[NOTE: In the interest of time, I may just let you go over most of sections V & VI on your own.] V. Using SPSS to handle missing data traditional methods

First, a caution: While I am going to show you how to implement various methods using SPSS and Stata, in most cases you may be as well off or better off just using listwise deletion or dropping highly problematic variables that have a lot of MD. If you feel your missing data problems are extremely severe, you should consider using more advanced techniques than what we discuss here. In some cases, other software, such as LISREL, may better meet your needs. A second caution: When using SPSS, Stata, or any program, be careful about permanently overwriting your original data. If you are going to plug in values for missing data, you may want to first create a copy of the original variable and then work on it. A third caution: If you are only analyzing a subsample of the data (e.g. women only) you want to be careful that your plugged in values are not computed from the entire sample. In either SPSS or Stata, you may want to create an extract with only the cases you want first, or otherwise control the sample selection that is being used. In general, when manipulating your data, run checks to make sure things are coming out the way you wanted them too!!! SPSS has an added-cost routine specifically designed to examine missing data. I havent seen it, but it sounds interesting. Using more traditional SPSS features: To assign the mean value to a variable: First, determine the mean of the variable, e.g. have something like DESCRIPTIVES VARIABLES=VAR01 Then, do something like


To assign a subgroup mean: First, determine the subgroup mean, e.g. have something like MEANS TABLES=VAR01 BY RACE Then, do something like

IF (RACE = 1 AND MISSING(VAR01)) VAR01 = 29. IF (RACE = 2 AND MISSING(VAR01)) VAR01 = 33.

To do mean substitution and create an MD indicator: Determine the mean

Missing DataPage 13

Then, do something like


To substitute a regression estimate for the mean: Run a regression where your IV is the dependant variable. Then, using the beta coefficients, do something like
IF (MISSING(VAR01)) VAR01 = 2X1 + 3X2 + 7.

To control whether SPSS Regression uses listwise, pairwise, or mean substitution: On the regression card, use the MISSING subcommand. Here is the SPSS documentation.
MISSING Subcommand
MISSING controls the treatment of cases with missing values. By default, a case that has a user-missing or system-missing value for any variable named or implied on VARIABLES is omitted from the computation of the correlation matrix on which all analyses are based. The minimum specification is a keyword specifying a missing-value treatment. LISTWISE Delete cases with missing values listwise. Only cases with valid values for all variables named on the current VARIABLES subcommand are used. If INCLUDE is also specified, only cases with system-missing values are deleted listwise. LISTWISE is the default if the MISSING subcommand is omitted. PAIRWISE Delete cases with missing values pairwise. Each correlation coefficient is computed using cases with complete data for the pair of variables correlated. If INCLUDE is also specified, only cases with systemmissing values are deleted pairwise. MEANSUBSTITUTION Replace missing values with the variable mean. All cases are included and the substitutions are treated as valid observations. If INCLUDE is also specified, user-missing values are treated as valid and are included in the computation of the means. INCLUDE Includes cases with user-missing values. All user-missing values are treated as valid values. This keyword can be specified along with the methods LISTWISE, PAIRWISE, or MEANSUBSTITUTION. Example REGRESSION VARIABLES=POP15,POP75,INCOME,GROWTH,SAVINGS /DEPENDENT=SAVINGS /METHOD=STEP

Missing DataPage 14

/MISSING=MEANSUBSTITUTION. System-missing and user-missing values are replaced with the means of the variables when the correlation matrix is calculated.


Using Stata to handle missing data traditional methods

Again, I preface my comments by saying that you generally dont want to use most of these methods! As far as traditional methods go, listwise deletion tends to work as well or better as anything else. Some things are easier to do in Stata than in Spss. While there are many ways to compute new variables with corrections for missing data, I find that the impute command is very handy. The basic syntax for impute is
impute depvar varlist [weight] [if exp] [in range], generate(newvar1)

The generate parameter creates a variable called newvar1 (you can call it whatever you want). If the original variable (depvar) is not missing, newvar1 = the original value. If depvar is missing, newvar1 is set equal to a regression estimate computed using the vars in varlist. That is, depvar is regressed on varlist. If some of the vars in varlist themselves have missing data, the regression estimate will be based only on the nonmissing variables. If depvar and all the vars in varlist are missing, newvar1 will also be missing, otherwise it will have a value. First, here are some summary statistics for the data set I am using. As you can see, 95 cases are missing on educ, and the rest have complete data.
. use http://www.nd.edu/~rwilliam/stats2/statafiles/md.dta, clear . sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 500 27.79 8.973491 5 48.3 educ | 405 13.01728 3.974821 2 21 jobexp | 500 13.52 5.061703 1 21 black | 500 .2 .4004006 0 1 other | 500 .1 .3003005 0 1 -------------+-------------------------------------------------------white | 500 .7 .4587165 0 1 race | 500 1.4 .6639893 1 3

Missing DataPage 15

To assign the mean value to a variable: You could also do it pretty much the same way as you do with SPSS:
. gen xeduc = educ (95 missing values generated) . replace xeduc = 13.01728 if missing(educ) (95 real changes made) . sum educ xeduc Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------educ | 405 13.01728 3.974821 2 21 xeduc | 500 13.01728 3.576498 2 21

Here is how we can do it with the impute command:

. gen one = 1 . impute educ one, gen(xeduc) 19.00% (95) observations imputed . sum educ xeduc Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------educ | 405 13.01728 3.974821 2 21 xeduc1 | 500 13.01728 3.576498 2 21

In this case, educ is regressed only on a constant, yielding a predicted value equal to the mean of educ. Hence, xeduc = educ when educ is not missing, xeduc = the mean of educ when educ is missing. In other words, the 95 missing cases all got assigned a value of 13.01728 on xeduc. As you see, however you do it, educ and xeduc have the same mean, but xeduc1 has no missing cases. The standard deviation declines because there is no variability in the plugged-in values. To assign a subgroup mean: The tab command can show us what the subgroup means are, and we can fill in the subgroup means from there. This will work, but it is more tedious and probably more error prone than using impute:
. tab race, sum(educ) | Summary of educ race | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 14.072202 3.5997967 277 2 | 9.9302326 4.3865857 86 3 | 12.380952 .79487324 42 ------------+-----------------------------------Total | 13.017284 3.9748214 405 . gen xeduc2 = educ (95 missing values generated) . replace xeduc2 = 14.072202 if race==1 & missing(xeduc2) (73 real changes made)

Missing DataPage 16

. replace xeduc2 = 9.9302326 if race==2 & missing(xeduc2) (14 real changes made) . replace xeduc2 = 12.380952 if race==3 & missing(xeduc2) (8 real changes made) . tab race, sum(xeduc2) | Summary of xeduc2 race | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 14.072202 3.2012515 350 2 | 9.9302326 4.0646063 100 3 | 12.380952 .72709601 50 ------------+-----------------------------------Total | 13.074683 3.6365787 500

Using the impute command is simpler and safer:

. impute educ black white other, gen(xeduc2) 19.00% (95) observations imputed . tab race, sum(xeduc2) | Summary of imputed educ race | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 14.072202 3.2012515 350 2 | 9.9302326 4.0646063 100 3 | 12.380952 .72709601 50 ------------+-----------------------------------Total | 13.074683 3.6365787 500

As you see, with either approach, the subgroup means are identical to before, but there are no missing cases. Each missing case had the mean value for its racial subgroup plugged in. Incidentally, if you didnt already have dichotomies computed for race, and just had a multicategory race var, the xi command could compute the dummies for you:
. xi: impute educ i.race, gen(xeduc2b) i.race _Irace_1-3 (naturally coded; _Irace_1 omitted) 19.00% (95) observations imputed . tab race, sum(xeduc2b) | Summary of imputed educ race | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 14.072202 3.2012515 350 2 | 9.9302326 4.0646063 100 3 | 12.380952 .72709601 50 ------------+-----------------------------------Total | 13.074683 3.6365787 500

To do mean substitution and create an MD indicator: After using one of the above methods to substitute the mean, you could do something like this:
. gen md = 0 . replace md = 1 if xeduc!=educ

Missing DataPage 17

If the original variable does not equal the imputed variable, that means a value was plugged in for missing cases. In such cases, md = 1. If educ does equal xeduc3, then no value was plugged in, and md = 0. Again, if the data are missing because they are non-existent, rather than missing because values exist but are unknown, this could be a good method.

To substitute a regression estimate for the mean: Just specify whatever vars you want to base your regression estimate (just be careful not to use the Y variable):
. impute educ jobexp black other white, gen(xeduc4) 19.00% (95) observations imputed . sum xed* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------xeduc1 | 500 13.01728 3.576498 2 21 xeduc2 | 500 13.07468 3.636579 2 21 xeduc2b | 500 13.07468 3.636579 2 21 xeduc3 | 500 13.01728 3.576498 2 21 xeduc4 | 500 13.08214 3.659779 2 21 . tab race, sum(xeduc4) | Summary of imputed educ race | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 14.071926 3.2328681 350 2 | 9.9475566 4.0879436 100 3 | 12.422792 .8384331 50 ------------+-----------------------------------Total | 13.082139 3.6597795 500

In this particular example, jobexp is not that strongly related to education, hence including it as one of the predictors of education did not have much of an effect on the estimated values over and above what we got when we just used the subgroup differences in the means. Again, keep in mind that the impute command will form a regression estimate based on all the nonmissing variables in varlist. So, for example, if a case was missing on both educ and jobexp, the imputation would be based on the regression of educ on black, other and white for all cases that were not missing on those variables. Unless a case is missing data on all the variables specified, impute will give a non-missing value for the imputed variable. To control whether Stata Regression uses listwise, pairwise, or mean substitution: Stata uses listwise deletion. As far as I know, there is no straightforward way to use pairwise deletion (if you desperately wanted it, I suppose you could compute the pairwise correlations and then use the corr2data command to create a data set with the desired correlations). If you want to do mean substitution, youd have to compute the vars yourself, using methods like those described above.

Missing DataPage 18