# Chapter 7 Survey Data

One major strength of Stata is the ease with which it can analyze data sets arising from complex sample surveys. When working with data collected from a sample with a complex design (anything above and beyond a simple random sample of a population, where the sample design involves clustering and stratification of sampled elements, and multiple stages of sampling), standard statistical analysis procedures that assume a simple random sample (such as everything we’ve discussed so far) will result in very biased estimates of statistics that do not take the design of the sample into account. Two major problems arise when survey data is analyzed without taking the design into account:

1. Representation
2. Variance Estimation

Incorporation of the weights corrects for biased estimates (representation) and the stratification and clustering produces correct variance estimates.

Stata is one of the leaders in terms of statistical software that can perform these types of analyses, and offers a wide variety of commands that will perform design-based analyses of data arising from a sample with a complex design. The basic process consists of two steps (similar to mi), first using svyset to describe the complex survey design, secondly using the svy: prefix to perform analyses.

## 7.1 Definitions

Complex survey design is a massive topic which there are entire departments devoted to (Program at Survey Methodology here at Michigan) and which we offer a separate full day workshop (Survey Design). A simple survey design takes a random sample from the population as a whole. There are various reasons why a simple random sample will not work.

• It is often infeasible to do either because of time or cost.
• With smaller sample sizes, it can be difficult to obtain enough individuals in a given subpopulation.
• For some small subpopulations, it may be very difficult to even obtain any individuals in a simple random sample.

A complex survey design allows researchers to consider these limitations and design a sampling pattern to overcome them. Three primary techniques are

• Stratification. Rather than sample all individuals, instead target specific subpopulations and collect from them explicitly. For example, you may stratify by race and aim to collect 50 white, 50 black, 50 Hispanic, etc.
• Clustering. Primarily a cost/time saving measure. Similar to stratification, but instead of sampling from all clusters, you take a random sample of clusters and then sample within them. A typical clustering variable is neighborhood or census tract or school.
• Weighting. If certain sets of characteristics are more or less common, or more or less desired, when randomly sampling individuals, we can down-weight those who we don’t want/are more common, and up-weight those we want/are less common.

For example, we might want to collect data on obesity in school children in Ann Arbor. Rather than randomly sampling across all schools, we cluster by schools and randomly select 3. Then at each of those schools, we stratify by race and take a random sample of all students of each race at each school, weighted by their weight to attempt to capture more overweight students.

One final term is primary sampling unit which is the first level at which we randomized. In this example, that would be schools.

## 7.2 Describing the survey

The general syntax is

svyset <psu> [pweight = <weight>], strata(<strata>)


The svyset command defines the variables identifying the complex design of the sample to Stata, and only needs to be submitted once in a given Stata session. The <psu> is a variable identifying the primary sampling unit (PSU) that an observation came from. The <weight> is a variable containing sampling weights. Finally, the <strata> is a variable identifying the sampling stratum that an observation came from.

The NHANES data we’ve been using in our examples is actually from a complex sample design, which we’ve been ignoring. Let’s incorporate the sampling into the analysis.

. webuse nhanes2, clear



The three variables of interest in the data are finalwgt for the sampling weights, strata for the strata, and psu for the clusters.

. describe finalwgt strata psu

storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
finalwgt        long    %9.0g                 sampling weight (except lead)
strata          byte    %9.0g                 stratum identifier, 1-32
psu             byte    %9.0g                 primary sampling unit, 1 or 2



It’s useful to know that to remove any existing survey design, you can run

. svyset, clear



Let’s set up the survey design now.

. svyset psu [pweight = finalwgt], strata(strata)

pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>



To get information about the strata and cluster variables use the following command or menu:

. svydescribe

Survey: Describing stage 1 sampling units

pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>

#Obs per Unit
----------------------------
Stratum    #Units     #Obs      min       mean      max
--------  --------  --------  --------  --------  --------
1         2       380       165     190.0       215
2         2       185        67      92.5       118
3         2       348       149     174.0       199
4         2       460       229     230.0       231
5         2       252       105     126.0       147
6         2       298       131     149.0       167
7         2       476       206     238.0       270
8         2       338       158     169.0       180
9         2       244       100     122.0       144
10         2       262       119     131.0       143
11         2       275       120     137.5       155
12         2       314       144     157.0       170
13         2       342       154     171.0       188
14         2       405       200     202.5       205
15         2       380       189     190.0       191
16         2       336       159     168.0       177
17         2       393       180     196.5       213
18         2       359       144     179.5       215
20         2       285       125     142.5       160
21         2       214       102     107.0       112
22         2       301       128     150.5       173
23         2       341       159     170.5       182
24         2       438       205     219.0       233
25         2       256       116     128.0       140
26         2       261       129     130.5       132
27         2       283       139     141.5       144
28         2       299       136     149.5       163
29         2       503       215     251.5       288
30         2       365       166     182.5       199
31         2       308       143     154.0       165
32         2       450       211     225.0       239
--------  --------  --------  --------  --------  --------
31        62    10,351        67     167.0       288



Once the survey is defined with svyset, most common commands can be prefaced by svy: to analyze the data with the sampling structure. The svy: tab command works exactly like the tabulate command, only taking the design of the sample into account when producing estimates and chi-square statistics.

. svy: tab sex
(running tabulate on estimation sample)

Number of strata   =        31                Number of obs     =       10,351
Number of PSUs     =        62                Population size   =  117,157,513
Design df         =           31

----------------------
1=male,   |
2=female  | proportion
----------+-----------
Male |      .4794
Female |      .5206
|
Total |          1
----------------------
Key:  proportion  =  cell proportion



Next, lets look at the mean weight by gender.

. svy: mean weight, over(sex)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      31      Number of obs   =       10,351
Number of PSUs   =      62      Population size =  117,157,513
Design df       =           31

--------------------------------------------------------------
|             Linearized
|       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
c.weight@sex |
Male  |   78.62789   .2097761      78.20004    79.05573
Female  |   65.70701    .266384      65.16372    66.25031
--------------------------------------------------------------



. mean weight, over(sex)

Mean estimation                   Number of obs   =     10,351

--------------------------------------------------------------
|       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
c.weight@sex |
Male  |   77.98423   .1945289      77.60292    78.36555
Female  |   66.39418   .1998523      66.00243    66.78593
--------------------------------------------------------------



And compare the svy: results to the usual mean command, with only the weights considered:

. mean weight [pweight=finalwgt], over(sex)

Mean estimation                   Number of obs   =     10,351

--------------------------------------------------------------
|       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
c.weight@sex |
Male  |   78.62789   .2272099      78.18251    79.07326
Female  |   65.70701   .2265547      65.26292     66.1511
--------------------------------------------------------------



We see that the weights affect on the standard error, whereas the stratification and clustering also affects the estimates.

Many of the usual commands such as regress or logit can be prefaced by svy:. If a command errors with the svy: prefix, a lot of the time the survey design will not affect it, and the documentation for the command will inform of that.

## 7.3 Subset analyses for complex sample survey data

In general, analysis of a particular subset of observations from a sample with a complex design should be handled very carefully. It is usually not appropriate to delete cases from the data-set that fall outside the sub-population of interest, or to use an if statement to filter them out. In Stata, sub-population analyses for this type of data are analyzed using a subpop indicator.

Suppose we want to perform an analysis only for the cases where race is black in the NHANES data set. First, we must create an indicator variable that equals 1 for these cases.

. gen race_black = race == 2

. replace race_black = . if race == .



Now we can run a simple regression model only on

. svy, subpop(race_black): regress weight height i.sex
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        30                Number of obs     =       10,013
Number of PSUs     =        60                Population size   =  113,415,086
Subpop. no. obs   =        1,086
Subpop. size      =   11,189,236
Design df         =           30
F(   2,     29)   =        50.12
Prob > F          =       0.0000
R-squared         =       0.1131

------------------------------------------------------------------------------
|             Linearized
weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
height |    .708568   .0728382     9.73   0.000     .5598126    .8573234
|
sex |
Female  |   3.508388   1.348297     2.60   0.014     .7547976    6.261979
_cons |  -46.10337   12.56441    -3.67   0.001    -71.76331   -20.44343
------------------------------------------------------------------------------
Note: 1 stratum omitted because it contains no subpopulation members.



Compare the svy, subpop( ): results to the usual svy: regress command using an if statement:

. svy: reg weight height i.sex if race_black == 1
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        30                 Number of obs     =       1,086
Number of PSUs     =        55                 Population size   =  11,189,236
Design df         =          25
F(   0,     25)   =           .
Prob > F          =           .
R-squared         =      0.1131

------------------------------------------------------------------------------
|             Linearized
weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
height |    .708568          .        .       .            .           .
|
sex |
Female  |   3.508388          .        .       .            .           .
_cons |  -46.10337          .        .       .            .           .
------------------------------------------------------------------------------
Note: Missing standard errors because of stratum with single sampling unit.



The point estimates and $$R^2$$ are the same, but Stata refuses to even calculate standard errors.