Connect Fall 1998  Statistics and Social Sciences


New SPSS Missing Values Analysis Option

Frank LoPresti

Early in my career as a data analyst, I was taught that a research project using questionnaires could easily spend 90 percent of its budget collecting and preparing data. To end up with a good data set, a substantial effort should go to the investigation and resolution of missing data.

It is the exceptional study that has no missing data. Missing data occurs for many reasons. Questionnaire respondents often feel uncomfortable answering certain questions, such as those dealing with age, income, sexual behavior or religious beliefs. Also, respondents accidentally skip items, sections or even entire pages of a questionnaire. Accidental omissions are usually made randomly and may not have a serious effect on the outcome. But sometimes, if the omissions are caused by poor questionnaire design or collection, they follow a pattern.

For example, research staff may be hesitant to enter neighborhoods they consider dangerous or out of the way. Thereby, a pattern of missing data might accidentally develop. Or perhaps language difficulties might lead to incomplete or incorrect responses. The pattern would be related to literacy.

Questions like income and age will frequently be missing in a non-random pattern. The distribution of missing ages would probably be skewed towards older ages. Income tends to be more heavily missing at both the lower and higher ends of the scale.

Missing items in a scale are a separate area of concern. Certainly, missing items from a list of vegetable preferences can be treated differently than items left unanswered in a questionnaire section dealing with a range of sexual behavior. When a person answers the first ten of 20 questions about green vegetables, it would probably be valid to use an average of the answered questions to calculate his or her "Green Vegetable" scale. This is not so with questions left unanswered in the sexual behavior section. The missing data may be linked to religion or gender. Respondents who don't answer all the questions often show a pattern in their skipped questions. Their missing responses merit investigation.

Missing data is so central to creating a useful data set that statistical packages, like SPSS and SAS, allow the researcher to code missing responses for further study and for special treatment in the statistical procedures. SPSS has a "System Missing" response that shows up on their spreadsheet as a period.

SPSS also allows the researcher to code several other values as missing in order to keep track of the specific reason the data was missing. For example, a valid value of the variable "Spouse's Age" could be coded as the actual value. That is, a respondent with a 22-year-old spouse would be assigned the value "22" for Spouse's Age. The number "98" could be used if the person refused to answer, and "99" used if the question was not applicable, such as if the person had no spouse.

Ask SPSS to prepare a frequency table of spouses' ages, and the table will give frequencies and percentages for the entire sample, and again for the sample without the missing values.

The missing data problem is illustrated by a data set calculated for three variables. For example, if some people in a study answer only two of the three questions asked, what number should we use as our sample size? Should we throw out any incomplete responses? Should we use different numbers in different sections of our report? Who did not answer each of the three questions? Are there non-random patterns of missing data? Missing data can make a mess of analysis.

SPSS Missing Values Analysis

SPSS has introduced a new Missing Values Analysis option to add to the current version, and ACF has acquired a small license to use and distribute it. This new option performs three primary functions. First, it describes the patterns of missing data. Second, it describes the data using univariate and multivariate statistics. Third, it creates a data set with imputed values for the missing data with a method chosen by the user.

The analysis starts by asking for a list of quantitative and catagorical variables to be considered. It then produces a table showing univariate statistics -- number of cases, mean, standard deviation, frequency and percentages of missing data -- and it uses the Tukey robust boxplot criterion for extreme low and high values. A two-way table details percent mismatch for pairs of indicator variables. A mismatch occurs between a pair of variables when either of the two variables has a missing value for a particular case.

A multivariate table called "Tabulated Patterns" gathers groups of cases that have missing values for sets of variables. In other words, it shows us clusters of missing values within a set of variables. We could investigate the groups of cases and determine if the cases are randomly distributed in our sample. If they are, and our number was large enough, we might choose to drop those cases and improve our valid data. For instance, we might find that a large group of people had the same six variables missing and then, by creating a flag variable for those cases and then listing cases, discover that they were all in our "Poor" socioeconomic group. Since this cluster of missing data was not randomly distributed, we must use a method other than simply dropping the cases.

Split the responses into two groups where the first group has a missing response on an important variable such as income, and the second group has answered that question. Do these groups differ significantly in their mean on other variables? In other words, are the people who form the group missing income data randomly drawn from the respondents? If not, we must treat those cases differently than we would if they were randomly taken from the respondents. We can't simply drop those cases. The missing values option readily generates t-tests to check for random distribution.

There are many considerations in choosing which variables to impute valid values to missing data, and in deciding which method to use. Before this new option was available, SPSS could still gather information, but the methods were much more tedious and less organized. Finally, this SPSS option creates a new data set with valid values replacing missing data. Values are imputed to the new data set using multiple regression and other methods.

The regression method lets the researcher choose the variables which best explain the variation in the missing data to be recoded. Multiple imputation allows randomness to be incorporated in regressed values. Therefore, randomly selected values from a chosen distribution around the regressed value will be imputed to the variable.

Most studies include a discussion of the factors responsible for missing values. In reputable studies, analysis of missing data is called for. It is very helpful to have all these analysis tasks in one procedure. A quick pass through this new SPSS option should improve the validity of even the most casual study.[ C ]


Frank LoPresti heads ACF's Statistics and Social Sciences Group.
frank.lopresti@nyu.edu

Posted October 5,1998