Chapter 7 Survey Data
One major strength of Stata is the ease with which it can analyze data sets arising from complex sample surveys. When working with data collected from a sample with a complex design (anything above and beyond a simple random sample of a population, where the sample design involves clustering and stratification of sampled elements, and multiple stages of sampling), standard statistical analysis procedures that assume a simple random sample (such as everything we’ve discussed so far) will result in very biased estimates of statistics that do not take the design of the sample into account. Two major problems arise when survey data is analyzed without taking the design into account:
 Representation
 Variance Estimation
Incorporation of the weights corrects for biased estimates (representation) and the stratification and clustering produces correct variance estimates.
Stata is one of the leaders in terms of statistical software that can perform these types of analyses, and offers a wide variety of commands that will perform designbased analyses of data arising from a sample with a complex design. The basic process consists of two steps (similar to mi
), first using svyset
to describe the complex survey design, secondly using the svy:
prefix to perform analyses.
7.1 Definitions
Complex survey design is a massive topic which there are entire departments devoted to (Program at Survey Methodology here at Michigan) and which we offer a separate full day workshop (Survey Design). A simple survey design takes a random sample from the population as a whole. There are various reasons why a simple random sample will not work.
 It is often infeasible to do either because of time or cost.
 With smaller sample sizes, it can be difficult to obtain enough individuals in a given subpopulation.
 For some small subpopulations, it may be very difficult to even obtain any individuals in a simple random sample.
A complex survey design allows researchers to consider these limitations and design a sampling pattern to overcome them. Three primary techniques are
 Stratification. Rather than sample all individuals, instead target specific subpopulations and collect from them explicitly. For example, you may stratify by race and aim to collect 50 white, 50 black, 50 Hispanic, etc.
 Clustering. Primarily a cost/time saving measure. Similar to stratification, but instead of sampling from all clusters, you take a random sample of clusters and then sample within them. A typical clustering variable is neighborhood or census tract or school.
 Weighting. If certain sets of characteristics are more or less common, or more or less desired, when randomly sampling individuals, we can downweight those who we don’t want/are more common, and upweight those we want/are less common.
For example, we might want to collect data on obesity in school children in Ann Arbor. Rather than randomly sampling across all schools, we cluster by schools and randomly select 3. Then at each of those schools, we stratify by race and take a random sample of all students of each race at each school, weighted by their weight to attempt to capture more overweight students.
One final term is primary sampling unit which is the first level at which we randomized. In this example, that would be schools.
7.2 Describing the survey
The general syntax is
svyset <psu> [pweight = <weight>], strata(<strata>)
The svyset
command defines the variables identifying the complex design of the sample to Stata, and only needs to be submitted once in a given Stata session. The <psu>
is a variable identifying the primary sampling unit (PSU) that an observation came from. The <weight>
is a variable containing sampling weights. Finally, the <strata>
is a variable identifying the sampling stratum that an observation came from.
The NHANES data we’ve been using in our examples is actually from a complex sample design, which we’ve been ignoring. Let’s incorporate the sampling into the analysis.
. webuse nhanes2, clear
The three variables of interest in the data are finalwgt
for the sampling weights, strata
for the strata, and psu
for the clusters.
. describe finalwgt strata psu
storage display value
variable name type format label variable label

finalwgt long %9.0g sampling weight (except lead)
strata byte %9.0g stratum identifier, 132
psu byte %9.0g primary sampling unit, 1 or 2
It’s useful to know that to remove any existing survey design, you can run
. svyset, clear
Let’s set up the survey design now.
. svyset psu [pweight = finalwgt], strata(strata)
pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>
To get information about the strata and cluster variables use the following command or menu:
. svydescribe
Survey: Describing stage 1 sampling units
pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>
#Obs per Unit

Stratum #Units #Obs min mean max
     
1 2 380 165 190.0 215
2 2 185 67 92.5 118
3 2 348 149 174.0 199
4 2 460 229 230.0 231
5 2 252 105 126.0 147
6 2 298 131 149.0 167
7 2 476 206 238.0 270
8 2 338 158 169.0 180
9 2 244 100 122.0 144
10 2 262 119 131.0 143
11 2 275 120 137.5 155
12 2 314 144 157.0 170
13 2 342 154 171.0 188
14 2 405 200 202.5 205
15 2 380 189 190.0 191
16 2 336 159 168.0 177
17 2 393 180 196.5 213
18 2 359 144 179.5 215
20 2 285 125 142.5 160
21 2 214 102 107.0 112
22 2 301 128 150.5 173
23 2 341 159 170.5 182
24 2 438 205 219.0 233
25 2 256 116 128.0 140
26 2 261 129 130.5 132
27 2 283 139 141.5 144
28 2 299 136 149.5 163
29 2 503 215 251.5 288
30 2 365 166 182.5 199
31 2 308 143 154.0 165
32 2 450 211 225.0 239
     
31 62 10,351 67 167.0 288
Once the survey is defined with svyset, most common commands can be prefaced by svy: to analyze the data with the sampling structure. The svy: tab command works exactly like the tabulate command, only taking the design of the sample into account when producing estimates and chisquare statistics.
. svy: tab sex
(running tabulate on estimation sample)
Number of strata = 31 Number of obs = 10,351
Number of PSUs = 62 Population size = 117,157,513
Design df = 31

1=male, 
2=female  proportion
+
Male  .4794
Female  .5206

Total  1

Key: proportion = cell proportion
Next, lets look at the mean weight by gender.
. svy: mean weight, over(sex)
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 31 Number of obs = 10,351
Number of PSUs = 62 Population size = 117,157,513
Design df = 31

 Linearized
 Mean Std. Err. [95% Conf. Interval]
+
c.weight@sex 
Male  78.62789 .2097761 78.20004 79.05573
Female  65.70701 .266384 65.16372 66.25031

Compare this to the usual mean command, without the design information:
. mean weight, over(sex)
Mean estimation Number of obs = 10,351

 Mean Std. Err. [95% Conf. Interval]
+
c.weight@sex 
Male  77.98423 .1945289 77.60292 78.36555
Female  66.39418 .1998523 66.00243 66.78593

And compare the svy:
results to the usual mean
command, with only the weights considered:
. mean weight [pweight=finalwgt], over(sex)
Mean estimation Number of obs = 10,351

 Mean Std. Err. [95% Conf. Interval]
+
c.weight@sex 
Male  78.62789 .2272099 78.18251 79.07326
Female  65.70701 .2265547 65.26292 66.1511

We see that the weights affect on the standard error, whereas the stratification and clustering also affects the estimates.
Many of the usual commands such as regress
or logit
can be prefaced by svy:
. If a command errors with the svy:
prefix, a lot of the time the survey design will not affect it, and the documentation for the command will inform of that.
7.3 Subset analyses for complex sample survey data
In general, analysis of a particular subset of observations from a sample with a complex design should be handled very carefully. It is usually not appropriate to delete cases from the dataset that fall outside the subpopulation of interest, or to use an if
statement to filter them out. In Stata, subpopulation analyses for this type of data are analyzed using a subpop indicator.
Suppose we want to perform an analysis only for the cases where race is black in the NHANES data set. First, we must create an indicator variable that equals 1 for these cases.
. gen race_black = race == 2
. replace race_black = . if race == .
(0 real changes made)
Now we can run a simple regression model only on
. svy, subpop(race_black): regress weight height i.sex
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 30 Number of obs = 10,013
Number of PSUs = 60 Population size = 113,415,086
Subpop. no. obs = 1,086
Subpop. size = 11,189,236
Design df = 30
F( 2, 29) = 50.12
Prob > F = 0.0000
Rsquared = 0.1131

 Linearized
weight  Coef. Std. Err. t P>t [95% Conf. Interval]
+
height  .708568 .0728382 9.73 0.000 .5598126 .8573234

sex 
Female  3.508388 1.348297 2.60 0.014 .7547976 6.261979
_cons  46.10337 12.56441 3.67 0.001 71.76331 20.44343

Note: 1 stratum omitted because it contains no subpopulation members.
Compare the svy, subpop( ):
results to the usual svy: regress
command using an if
statement:
. svy: reg weight height i.sex if race_black == 1
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 30 Number of obs = 1,086
Number of PSUs = 55 Population size = 11,189,236
Design df = 25
F( 0, 25) = .
Prob > F = .
Rsquared = 0.1131

 Linearized
weight  Coef. Std. Err. t P>t [95% Conf. Interval]
+
height  .708568 . . . . .

sex 
Female  3.508388 . . . . .
_cons  46.10337 . . . . .

Note: Missing standard errors because of stratum with single sampling unit.
The point estimates and \(R^2\) are the same, but Stata refuses to even calculate standard errors.