By Frank LoPresti
November 24, 2009
This article is a brief introduction to survey sampling theory, and a look at two popular statistical packages (SUDAAN and Stata's "svy") available at the Data Service Studio that allow researchers to use datasets created by sampling. Survey sampling is a large and complex subject, so to give it some context, this article will begin by highlighting three research projects, their associated datasets, and the research issues that led to their creation. The first was accomplished without sampling; the second with simple random sampling. The third involved complex sampling, which leads to the types of weights and datasets that require complex survey analysis tools. I'll conclude with a quick view of SUDAAN and Stata's svy, both of which deal with complex survey data.
When I first look at a dataset, I always search the documentation for the words "sample" and "weight." The documentation will give details on how the data was collected. If there are no sections dealing with sampling, and if there is no discussion of weights, then the analysis can proceed without using weights and without using specialized software. If you ignore weights, when there is discussion of weights in the documentation, the analysis will be flawed.
Years ago, I was asked to look at students’ entrance test results in relation to their success in their first year of undergraduate study at a university. The data, comprising the students’ first year grades and their SAT results, was available from the university student information system. They came to me in Microsoft Excel format, with every student’s data ready to go. Because the dataset was small, consisting of roughly 10,000 data points, it wasn’t prohibitively expensive to create. Often the cost or impossibility of obtaining a complete dataset is the reason we end up sampling, as the next example shows. In this case, sampling was unnecessary.
Recently, I was asked to help with a simple dataset of nine million admissions to all hospitals within a state in a particular year (using just an ID number, a street address, and a zip code). The researchers wanted to merge this data with two files. One would have census track level data on income, race, age, etc., and would be merged with all nine million records (if the census track data for each admitted person was known). The second file would have data on the hospital diagnoses, cost, etc., for all the admissions.
The challenge was connecting the census track number with the patient’s address. This involved geocoding, a function in GIS. An article in the American Journal of Public Health discusses the cost of geocoding.1 The article suggests that a 50 percent correct geocoding would cost approximately $40,000 for 1 million records. Because the project was grant-funded, this approach was unaffordable. We chose to do a Simple Random Sample (SRS) using SPSS and geocode 1 in 100 records, an acceptable alternative. No specialized statistical package was required. An SRS is much like putting a piece of paper for each of the 9 million records in a hat and pulling out 90,000, or 1 out of each 100. Since these were randomly chosen, the weight for each datum is 100 in the study population, which meant our dataset would have 90,000 data instead of 9 million.
The first big research project in which I participated was an Environmental Protection Agency study of drinking water drawn straight from the tap. The $2 million grant budgeted $1 million for water collection and lab analysis. The other half would be spent on writing the report. Our budget allowed us to collect and do a lab analysis for 10,000 glasses of tap water across the US.
A biostatistician, one of the principle investigators, first had to develop a method to decide which houses to sample. She chose to use a complex sample, which involves concepts such as strata, clusters, and oversampling. Strata, divisions of the United States needed for later reporting, were created starting with "regions" and "states." She created sampling strata for seven regions using SUDAAN. Each state within a region was another level of strata. She sampled all regions, as well as all states. She created rural versus urban strata clusters within each state. Other divisions were included in the design of this complex sample.
When data in a study are expected to differ among strata, the strata are called clusters. The bacteria count in tap water in urban areas wasn’t expected to differ much from one state to another. On the other hand, as it was assumed that rural well water was less regulated than city water, the "urban" and "rural" strata clusters were created.
The sample was constructed with the mathematical rigor that would allow us to report, for example, "In the US, 40 percent of the population drinks X glasses of tap water," or, "The Midwest region has X levels of radon." Each line of data from one house represented about 1,000 houses – sometimes more, sometimes less, depending on the cluster – and this number was the weight assigned to the cluster within the entire sample. For example, in NYC, where the infrastructure for delivering city water is good, it wasn’t an efficient use of funds to test one of every 1000 taps. But in a rural area without a city water system it made sense to sample more – to oversample – since neighboring houses might have very different well water coming from their taps. Thus the number of houses represented by each line of data, or weight, varied from cluster to cluster.
The first dataset in this discussion was not a sample. Rather, it was a full year's grades for all new students at a university. Any of the most popular statistical packages at NYU (SPSS, SAS, R, Stata, Minitab, etc.) could be used without any reference to weights while running the analysis. In the example of the simple random sample of hospital admissions, the data could be analyzed with any of the same packages, all of which enable the statistician to pick a single weight variable. In that case, since every admission represented 100 admissions, the weight was always 100.
The tap water example, however, involved a multi-staged sample with subsets of unequal size and oversampling. The many weight variables—for regions, states, urban/rural clusters, and households—required complex calculations that are best performed with packages like SUDAAN and Stata svy.
Often, the creators of complex datasets will provide the code that allows researchers to quickly get a package like SUDAAN up and running. (See http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03088 for an advanced example where the "…Finite Population Correction Factor and the two Stratified Jackknife Factor data files are provided for use with the WesVar and SUDAAN statistical software.")
Survey sampling occurs in many different forms and many levels of complexity. This complexity determines how the data should be processed using weights. In some cases, such as the entrance test dataset, the data was easily available in digital form from the University. There was no cost in processing more, rather than less, data. No sampling was required. For the hospital admissions survey, the records had to be processed at a fixed cost per record. I chose a simple random sample in order to affordably use files that contained 9 million records. Because there was only a single weight variable, a more extensive statistical package wasn’t necessary. In the EPA water quality study, the scale and the issue of oversampling certain areas made it necessary to perform a more complex sample. These three examples show the range of sampling methods. More complex weights force us to use more sophisticated statistical packages. To do otherwise could potentially result in a flawed analysis.
Frank LoPresti is a Senior Faculty Technology Specialist for ITS Data Services.
This Article is in the following Topics:
Connect - Information Technology at NYU