[Ed: Links to web pages and/or email addresses which have become inactive since the publication of this article have been enclosed in curly brackets { }. Replacement links have been provided where possible.]
SUDAAN is a very powerful and flexible statistical package designed for analysis of complex samples. The fact that it can read either SAS or SPSS data sets makes it widely applicable to already existing national or large-scale survey research data sets distributed by ICPSR and other institutions. Its user-friendly syntax makes it an ideal teaching tool for professors who teach survey sampling or do serious research.Researchers often find, when analyzing large-scale data sets from national probability samples, that the sets were collected using complex sampling designs involving stratification, clustering and/or replication. Such collection techniques can affect the statistical analysis. For instance, stratification involves basing the study's focus on a set of grouping variables, stratifying the whole population into cross-classified groups and randomly sampling within those groups. This stratification yields sampling variances within the strata that are significantly smaller than those of a simple random sample.
Many researchers are not aware that standard statistical packages presume a simple random sample. In other words, the packages by themselves do not adjust for aspects of complex sample design. In most complex designs, statistical adjustments have to be made by weighting for unequal probability of selection, nonresponse, stratification, replication and clustering at each stage of the sampling.
Accurate estimates of population parameters depend on the weights used in the analysis, and these in turn are based on the selection probabilities along with other characteristics of the stratification or clustering. Generally, weighting adjustments are separately made for enumeration when the original census is taken, and for interview response during the survey. If researchers ignore the issue of weighting, statistics estimating point parameters will be biased.
Moreover, without adjusting for strata or clustering effects, the variance, standard error and confidence interval estimation would be wrong. Stratification into relatively homogeneous subpopulations is usually performed so that the strata variances are more compressed than are those of a simple random sample. The effect is to yield more accurate estimates of population parameters. If the statistician were to presume a simple unweighted, random sample, strata variances would be inflated. To properly assess the strata variances, adjustments must be made.
Some government granting agencies are inclined to insist on sampling design adjustments to ensure the accuracy of their significance tests and other estimates. SUDAAN is a program widely used by U.S. government agencies, including the U.S. Centers for Disease Control and the U.S. Food and Drug Administration, to adjust the variances, standard errors and confidence intervals in accordance with the sample design.
SUDAAN version 7.5.2 allows the user to specify variables for stratification, levels of stratification, clustering, nesting, subpopulations, levels of subpopulations, the case identification variable, total counts, sample counts at each sample stage, and joint probability variables for unequal selection without replacement. In multistage sampling, the nesting variables are listed. Nesting variables specify the primary sampling unit level and the number of stratification levels, as well as whether the records are sorted or not. The ID variables are indicated, and in order to calculate the sampling fractions at each stage, the total count and sample count variables are indicated. From these the sample weights are computed. These variables may be specified in the regular data file or they may be specified in a special PSUDATA file, which SUDAAN reads separately.
Once the design is specified and the sample variables are constructed in the data set, adjustments are made and the variances are corrected specifically for the statistical procedure invoked. SUDAAN allows an examination of the data dictionary with a records procedure. With a Descriptive procedure, it can estimate the sample size, population size, means, proportions, geometric means, quantiles, standard errors, and design effect for each level of a variable under examination.
With the Ratio procedure, estimates and standard errors for generalized ratios can be computed. With the Crosstabs procedure, SUDAAN can compute frequencies, percentage distributions, odds ratios, relative risks and standard errors, as well as chi-square tests for independence and Cochran-Mantel-Haenszel Chi-square tests for stratified two-way tables.
With the Regression procedure, the program fits linear regression models and tests hypotheses for model parameters. It uses generalized estimating equations to efficiently estimate the regression parameters with robust variance estimation. The values of the dependent variable for different levels of an independent variable or interactions between two or more independent variables may be tested with effects or contrasts statements. Even least square means for different levels of categorical covariates may be computed.
The program can perform an assortment of logistic regression analysis for binary, ordinal and multinomial dependent variables. It can estimate odds ratios and confidence intervals for the model parameters. It can employ generalized estimating equations for robust variance estimation to calculate the standard errors for cross-sectional or longitudinal models with dichotomous, count, ordinal or continuous dependent variables. In sum, SUDAAN provides a wide variety of powerful statistical analyses that can be adjusted for complex sample design.
File management in SUDAAN is easiest when most of the work is done in either SAS or SPSS. SUDAAN version 7.5.2 can read SAS 6.12 and SPSS/Windows version 8 data sets directly, but if SUDAAN reads ASCII files two ASCII files are needed and two additional optional files are recommended. The different SUDAAN ASCII file types are distinguished by the file suffixes. It needs a data file, called a DBS file. To indicate which variables are to be found where in the data set, SUDAAN needs a codebook file called a .LAB file. Two optional documentation files are a .FLD file, which specifies a Title for the data set, and a LEV file, which specifies the labels for the levels (answer categories) of discrete variables. To make the syntax easy to understand for the sophisticated user, the SUDAAN syntax is almost identical to that of SAS. The SUDAAN file has input syntax defining the design and accessing the data and output syntax that formats the output.
SUDAAN users are advised to do their preprocessing and data management with another program beforehand. SUDAAN is not designed to be a data man-agement package and it lacks the features that would endow it with good data management capability. Even so, all data must be converted to numeric type and sorted according to the ascending levels of the nesting variable. Categorical data must be recoded so that none of the variables has a 0 code, for in SUDAAN that code stands for a missing value. Recoding is better done with other packages, although SUDAAN can handle perform this task. Missing value estimation should be taken care of before SUDAAN analysis. No missing values are tolerated among the sample design variables. Nonetheless, observations without complete sample design data are unforgivingly dropped from the analysis.
In SUDAAN, the sample design has to be specified separately under each statistical analysis invoked. SUDAAN can handle the basic kinds of analysis, such as fixed effects in a general linear model. It cannot perform mixed model ANOVAs, for the procedures within SUDAAN do not handle random effects.
Like WesVar Complex samples, its less expensive competitor, SUDAAN performs the BRR and jackknifing estimation of robust variances, but it also can perform robust variance estimates with Taylor Series linear approximation. If public use files are designed for a particular type of variance estimation, SUDAAN, unlike its competitors, has the capability of handling all three of them.
SUDAAN, unlike WesVar, can handle multiple types of survival analysis, including Cox regression. While neither SPSS nor WesVar does generalized estimating equations for longitudinal analysis, SUDAAN does. In short, SUDAAN is currently the most powerful of all of these packages that performs complex sample data analysis.
To process SUDAAN, the computer needs to be a 386 or more powerful PC compatible with a math coprocessor, with at least 4 MB of random access memory. In fact, it is recommended that it have 8 MB of ram, with 5 MB of hard disk space, for best results.
SUDAAN 7.5.2 runs as a standalone or as a procedure within SAS. Standalone versions can run on PC DOS, Windows 3.1, Windows95, Windows NT, SUN Solaris, VAX/VMS, and DEC Alphas with Open VMS. The SAS callable version, which is installed as a procedure within SAS, can run under Windows95, Windows NT, SUN Solaris, DEC VAX, or IBM MVS.
SUDAAN provides several very informative and well-organized training workshops each year at different locations in the United States. Interested persons should contact either SUDAAN in Research Triangle Park, N.C. (919) 541-6602 via phone or http://www.rti.org/sudaan/ via the World Wide Web. For questions about ACF availability of the software, interested persons may contact the author by phone at (212) 998-3402 or by e-mail at {robert.yaffee@nyu.edu}.
Shah, B., Barnwell, B., Bieler, G.S. (1997). SUDAAN User's Manuals, Vols. I & II, Research Triangle Park, NC: Research Triangle Institute.
Williams, R. (1998). SUDAAN software, Research Triangle Institute, (personal communication), June 12,1998.![]()
Posted: October 5, 1998. Last reviewed: March 13, 2007.
|
|
|
| |