6 Survey Data
One major strength of Stata is the ease with which it can analyze data sets arising from complex sample surveys. When working with data collected from a sample with a complex design (anything above and beyond a simple random sample of a population, where the sample design involves clustering and stratification of sampled elements, and multiple stages of sampling), standard statistical analysis procedures that assume a simple random sample (such as everything we’ve discussed so far) will result in very biased estimates of statistics that do not take the design of the sample into account. Two major problems arise when survey data is analyzed without taking the design into account:
- Representation
- Variance Estimation
Incorporation of the weights corrects for biased estimates (representation) and the stratification and clustering produces correct variance estimates.
Stata is one of the leaders in terms of statistical software that can perform these types of analyses, and offers a wide variety of commands that will perform design-based analyses of data arising from a sample with a complex design. The basic process consists of two steps (similar to mi
), first using svyset
to describe the complex survey design, secondly using the svy:
prefix to perform analyses.
6.1 Definitions
Complex survey design is a massive topic which there are entire departments devoted to (Program at Survey Methodology here at Michigan) and which we offer a separate full day workshop (Survey Design). A simple survey design takes a random sample from the population as a whole. There are various reasons why a simple random sample will not work.
- It is often infeasible to do either because of time or cost.
- With smaller sample sizes, it can be difficult to obtain enough individuals in a given subpopulation.
- For some small subpopulations, it may be very difficult to even obtain any individuals in a simple random sample.
A complex survey design allows researchers to consider these limitations and design a sampling pattern to overcome them. Three primary techniques are
- Stratification. Rather than sample all individuals, instead target specific subpopulations and collect from them explicitly. For example, you may stratify by race and aim to collect 50 white, 50 black, 50 Hispanic, etc.
- Clustering. Primarily a cost/time saving measure. Similar to stratification, but instead of sampling from all clusters, you take a random sample of clusters and then sample within them. A typical clustering variable is neighborhood or census tract or school.
- Weighting. If certain sets of characteristics are more or less common, or more or less desired, when randomly sampling individuals, we can down-weight those who we don’t want/are more common, and up-weight those we want/are less common.
For example, we might want to collect data on obesity in school children in Ann Arbor. Rather than randomly sampling across all schools, we cluster by schools and randomly select 3. Then at each of those schools, we stratify by race and take a random sample of all students of each race at each school, weighted by their weight to attempt to capture more overweight students.
One final term is primary sampling unit which is the first level at which we randomized. In this example, that would be schools.
6.2 Describing the survey
The general syntax is
svyset <psu> [pweight = <weight>], strata(<strata>)
The svyset
command defines the variables identifying the complex design of the sample to Stata, and only needs to be submitted once in a given Stata session. The <psu>
is a variable identifying the primary sampling unit (PSU) that an observation came from. The <weight>
is a variable containing sampling weights. Finally, the <strata>
is a variable identifying the sampling stratum that an observation came from.
The NHANES data we’ve been using in our examples is actually from a complex sample design, which we’ve been ignoring. Let’s incorporate the sampling into the analysis.
webuse nhanes2, clear .
The three variables of interest in the data are finalwgt
for the sampling weights, strata
for the strata, and psu
for the clusters.
describe finalwgt strata psu
.
Variable Storage Display Valuename type format label Variable label
-------------------------------------------------------------------------------long %9.0g Sampling weight (except lead)
finalwgt strata byte %9.0g Stratum identifier
byte %9.0g psulbl Primary sampling unit psu
It’s useful to know that to remove any existing survey design, you can run
svyset, clear .
Let’s set up the survey design now.
svyset psu [pweight = finalwgt], strata(strata)
.
Sampling weights: finalwgtVCE: linearized
missing
Single unit: strata
Strata 1:
Sampling unit 1: psuzero> FPC 1: <
To get information about the strata and cluster variables use the following command or menu:
. svydescribe
Survey: Describing stage 1 sampling units
Sampling weights: finalwgtVCE: linearized
missing
Single unit: strata
Strata 1:
Sampling unit 1: psuzero>
FPC 1: <
of obs per unit
Number obs Min Mean Max
Stratum # units #
----------------------------------------------------------
1 2 380 165 190.0 215
2 2 185 67 92.5 118
3 2 348 149 174.0 199
4 2 460 229 230.0 231
5 2 252 105 126.0 147
6 2 298 131 149.0 167
7 2 476 206 238.0 270
8 2 338 158 169.0 180
9 2 244 100 122.0 144
10 2 262 119 131.0 143
11 2 275 120 137.5 155
12 2 314 144 157.0 170
13 2 342 154 171.0 188
14 2 405 200 202.5 205
15 2 380 189 190.0 191
16 2 336 159 168.0 177
17 2 393 180 196.5 213
18 2 359 144 179.5 215
20 2 285 125 142.5 160
21 2 214 102 107.0 112
22 2 301 128 150.5 173
23 2 341 159 170.5 182
24 2 438 205 219.0 233
25 2 256 116 128.0 140
26 2 261 129 130.5 132
27 2 283 139 141.5 144
28 2 299 136 149.5 163
29 2 503 215 251.5 288
30 2 365 166 182.5 199
31 2 308 143 154.0 165
32 2 450 211 225.0 239
---------------------------------------------------------- 31 62 10,351 67 167.0 288
Once the survey is defined with svyset, most common commands can be prefaced by svy: to analyze the data with the sampling structure. The svy: tab command works exactly like the tabulate command, only taking the design of the sample into account when producing estimates and chi-square statistics.
svy: tab sex
. tabulate on estimation sample)
(running
of strata = 31 Number of obs = 10,351
Number of PSUs = 62 Population size = 117,157,513
Number
Design df = 31
----------------------proportion
Sex |
----------+-----------
Male | .4794
Female | .5206
|
Total | 1
----------------------proportion = Cell proportion Key:
Next, lets look at the mean weight by gender.
svy: mean weight, over(sex)
. mean on estimation sample)
(running
Survey: Mean estimation
of strata = 31 Number of obs = 10,351
Number of PSUs = 62 Population size = 117,157,513
Number
Design df = 31
--------------------------------------------------------------
| Linearizedstd. err. [95% conf. interval]
| Mean
-------------+------------------------------------------------
c.weight@sex |
Male | 78.62789 .2097761 78.20004 79.05573
Female | 65.70701 .266384 65.16372 66.25031 --------------------------------------------------------------
Compare this to the usual mean command, without the design information:
mean weight, over(sex)
.
of obs = 10,351
Mean estimation Number
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
c.weight@sex |
Male | 77.98423 .1945289 77.60292 78.36555
Female | 66.39418 .1998523 66.00243 66.78593 --------------------------------------------------------------
And compare the svy:
results to the usual mean
command, with only the weights considered:
mean weight [pweight=finalwgt], over(sex)
.
of obs = 10,351
Mean estimation Number
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
c.weight@sex |
Male | 78.62789 .2272099 78.18251 79.07326
Female | 65.70701 .2265547 65.26292 66.1511 --------------------------------------------------------------
We see that the weights affect on the standard error, whereas the stratification and clustering also affects the estimates.
Many of the usual commands such as regress
or logit
can be prefaced by svy:
. If a command errors with the svy:
prefix, a lot of the time the survey design will not affect it, and the documentation for the command will inform of that.
6.3 Subset analyses for complex sample survey data
In general, analysis of a particular subset of observations from a sample with a complex design should be handled very carefully. It is usually not appropriate to delete cases from the data-set that fall outside the sub-population of interest, or to use an if
statement to filter them out. In Stata, sub-population analyses for this type of data are analyzed using a subpop indicator.
Suppose we want to perform an analysis only for the cases where race is black in the NHANES data set. First, we must create an indicator variable that equals 1 for these cases.
gen race_black = race == 2
.
replace race_black = . if race == .
. real changes made) (0
Now we can run a simple regression model only on
svy, subpop(race_black): regress weight height i.sex
. regress on estimation sample)
(running
Survey: Linear regression
of strata = 30 Number of obs = 10,013
Number of PSUs = 60 Population size = 113,415,086
Number obs = 1,086
Subpop. no. size = 11,189,236
Subpop.
Design df = 30F(2, 29) = 50.12
F = 0.0000
Prob >
R-squared = 0.1131
------------------------------------------------------------------------------
| Linearizedweight | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
height | .708568 .0728382 9.73 0.000 .5598126 .8573234
|
sex |
Female | 3.508388 1.348297 2.60 0.014 .7547976 6.261979_cons | -46.10337 12.56441 -3.67 0.001 -71.76331 -20.44343
------------------------------------------------------------------------------ Note: 1 stratum omitted because it contains no subpopulation members.
Compare the svy, subpop( ):
results to the usual svy: regress
command using an if
statement:
svy: reg weight height i.sex if race_black == 1
. regress on estimation sample)
(running
Survey: Linear regression
of strata = 30 Number of obs = 1,086
Number of PSUs = 55 Population size = 11,189,236
Number
Design df = 25F(0, 25) = .
F = .
Prob >
R-squared = 0.1131
------------------------------------------------------------------------------
| Linearizedweight | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
height | .708568 . . . . .
|
sex |
Female | 3.508388 . . . . ._cons | -46.10337 . . . . .
------------------------------------------------------------------------------of stratum with single sampling unit. Note: Missing standard errors because
The point estimates and \(R^2\) are the same, but Stata refuses to even calculate standard errors.