6  Survey Data

One major strength of Stata is the ease with which it can analyze data sets arising from complex sample surveys. When working with data collected from a sample with a complex design (anything above and beyond a simple random sample of a population, where the sample design involves clustering and stratification of sampled elements, and multiple stages of sampling), standard statistical analysis procedures that assume a simple random sample (such as everything we’ve discussed so far) will result in very biased estimates of statistics that do not take the design of the sample into account. Two major problems arise when survey data is analyzed without taking the design into account:

  1. Representation
  2. Variance Estimation

Incorporation of the weights corrects for biased estimates (representation) and the stratification and clustering produces correct variance estimates.

Stata is one of the leaders in terms of statistical software that can perform these types of analyses, and offers a wide variety of commands that will perform design-based analyses of data arising from a sample with a complex design. The basic process consists of two steps (similar to mi), first using svyset to describe the complex survey design, secondly using the svy: prefix to perform analyses.

6.1 Definitions

Complex survey design is a massive topic which there are entire departments devoted to (Program at Survey Methodology here at Michigan) and which we offer a separate full day workshop (Survey Design). A simple survey design takes a random sample from the population as a whole. There are various reasons why a simple random sample will not work.

  • It is often infeasible to do either because of time or cost.
  • With smaller sample sizes, it can be difficult to obtain enough individuals in a given subpopulation.
  • For some small subpopulations, it may be very difficult to even obtain any individuals in a simple random sample.

A complex survey design allows researchers to consider these limitations and design a sampling pattern to overcome them. Three primary techniques are

  • Stratification. Rather than sample all individuals, instead target specific subpopulations and collect from them explicitly. For example, you may stratify by race and aim to collect 50 white, 50 black, 50 Hispanic, etc.
  • Clustering. Primarily a cost/time saving measure. Similar to stratification, but instead of sampling from all clusters, you take a random sample of clusters and then sample within them. A typical clustering variable is neighborhood or census tract or school.
  • Weighting. If certain sets of characteristics are more or less common, or more or less desired, when randomly sampling individuals, we can down-weight those who we don’t want/are more common, and up-weight those we want/are less common.

For example, we might want to collect data on obesity in school children in Ann Arbor. Rather than randomly sampling across all schools, we cluster by schools and randomly select 3. Then at each of those schools, we stratify by race and take a random sample of all students of each race at each school, weighted by their weight to attempt to capture more overweight students.

One final term is primary sampling unit which is the first level at which we randomized. In this example, that would be schools.

6.2 Describing the survey

The general syntax is

svyset <psu> [pweight = <weight>], strata(<strata>)

The svyset command defines the variables identifying the complex design of the sample to Stata, and only needs to be submitted once in a given Stata session. The <psu> is a variable identifying the primary sampling unit (PSU) that an observation came from. The <weight> is a variable containing sampling weights. Finally, the <strata> is a variable identifying the sampling stratum that an observation came from.

The NHANES data we’ve been using in our examples is actually from a complex sample design, which we’ve been ignoring. Let’s incorporate the sampling into the analysis.

. webuse nhanes2, clear

The three variables of interest in the data are finalwgt for the sampling weights, strata for the strata, and psu for the clusters.

. describe finalwgt strata psu

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
finalwgt        long    %9.0g                 Sampling weight (except lead)
strata          byte    %9.0g                 Stratum identifier
psu             byte    %9.0g      psulbl     Primary sampling unit

It’s useful to know that to remove any existing survey design, you can run

. svyset, clear

Let’s set up the survey design now.

. svyset psu [pweight = finalwgt], strata(strata)

Sampling weights: finalwgt
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: <zero>

To get information about the strata and cluster variables use the following command or menu:

. svydescribe

Survey: Describing stage 1 sampling units

Sampling weights: finalwgt
             VCE: linearized
     Single unit: missing
        Strata 1: strata
 Sampling unit 1: psu
           FPC 1: <zero>

                                    Number of obs per unit
 Stratum   # units     # obs       Min      Mean       Max
----------------------------------------------------------
       1         2       380       165     190.0       215
       2         2       185        67      92.5       118
       3         2       348       149     174.0       199
       4         2       460       229     230.0       231
       5         2       252       105     126.0       147
       6         2       298       131     149.0       167
       7         2       476       206     238.0       270
       8         2       338       158     169.0       180
       9         2       244       100     122.0       144
      10         2       262       119     131.0       143
      11         2       275       120     137.5       155
      12         2       314       144     157.0       170
      13         2       342       154     171.0       188
      14         2       405       200     202.5       205
      15         2       380       189     190.0       191
      16         2       336       159     168.0       177
      17         2       393       180     196.5       213
      18         2       359       144     179.5       215
      20         2       285       125     142.5       160
      21         2       214       102     107.0       112
      22         2       301       128     150.5       173
      23         2       341       159     170.5       182
      24         2       438       205     219.0       233
      25         2       256       116     128.0       140
      26         2       261       129     130.5       132
      27         2       283       139     141.5       144
      28         2       299       136     149.5       163
      29         2       503       215     251.5       288
      30         2       365       166     182.5       199
      31         2       308       143     154.0       165
      32         2       450       211     225.0       239
----------------------------------------------------------
      31        62    10,351        67     167.0       288

Once the survey is defined with svyset, most common commands can be prefaced by svy: to analyze the data with the sampling structure. The svy: tab command works exactly like the tabulate command, only taking the design of the sample into account when producing estimates and chi-square statistics.

. svy: tab sex
(running tabulate on estimation sample)

Number of strata = 31                            Number of obs   =      10,351
Number of PSUs   = 62                            Population size = 117,157,513
                                                 Design df       =          31

----------------------
      Sex | proportion
----------+-----------
     Male |      .4794
   Female |      .5206
          | 
    Total |          1
----------------------
Key: proportion = Cell proportion

Next, lets look at the mean weight by gender.

. svy: mean weight, over(sex)
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31            Number of obs   =      10,351
Number of PSUs   = 62            Population size = 117,157,513
                                 Design df       =          31

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
c.weight@sex |
       Male  |   78.62789   .2097761      78.20004    79.05573
     Female  |   65.70701    .266384      65.16372    66.25031
--------------------------------------------------------------

Compare this to the usual mean command, without the design information:

. mean weight, over(sex)

Mean estimation                         Number of obs = 10,351

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
c.weight@sex |
       Male  |   77.98423   .1945289      77.60292    78.36555
     Female  |   66.39418   .1998523      66.00243    66.78593
--------------------------------------------------------------

And compare the svy: results to the usual mean command, with only the weights considered:

. mean weight [pweight=finalwgt], over(sex)

Mean estimation                         Number of obs = 10,351

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
c.weight@sex |
       Male  |   78.62789   .2272099      78.18251    79.07326
     Female  |   65.70701   .2265547      65.26292     66.1511
--------------------------------------------------------------

We see that the weights affect on the standard error, whereas the stratification and clustering also affects the estimates.

Many of the usual commands such as regress or logit can be prefaced by svy:. If a command errors with the svy: prefix, a lot of the time the survey design will not affect it, and the documentation for the command will inform of that.

6.3 Subset analyses for complex sample survey data

In general, analysis of a particular subset of observations from a sample with a complex design should be handled very carefully. It is usually not appropriate to delete cases from the data-set that fall outside the sub-population of interest, or to use an if statement to filter them out. In Stata, sub-population analyses for this type of data are analyzed using a subpop indicator.

Suppose we want to perform an analysis only for the cases where race is black in the NHANES data set. First, we must create an indicator variable that equals 1 for these cases.

. gen race_black = race == 2

. replace race_black = . if race == .
(0 real changes made)

Now we can run a simple regression model only on

. svy, subpop(race_black): regress weight height i.sex
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 30                            Number of obs   =      10,013
Number of PSUs   = 60                            Population size = 113,415,086
                                                 Subpop. no. obs =       1,086
                                                 Subpop. size    =  11,189,236
                                                 Design df       =          30
                                                 F(2, 29)        =       50.12
                                                 Prob > F        =      0.0000
                                                 R-squared       =      0.1131

------------------------------------------------------------------------------
             |             Linearized
      weight | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      height |    .708568   .0728382     9.73   0.000     .5598126    .8573234
             |
         sex |
     Female  |   3.508388   1.348297     2.60   0.014     .7547976    6.261979
       _cons |  -46.10337   12.56441    -3.67   0.001    -71.76331   -20.44343
------------------------------------------------------------------------------
Note: 1 stratum omitted because it contains no subpopulation members.

Compare the svy, subpop( ): results to the usual svy: regress command using an if statement:

. svy: reg weight height i.sex if race_black == 1
(running regress on estimation sample)

Survey: Linear regression

Number of strata = 30                             Number of obs   =      1,086
Number of PSUs   = 55                             Population size = 11,189,236
                                                  Design df       =         25
                                                  F(0, 25)        =          .
                                                  Prob > F        =          .
                                                  R-squared       =     0.1131

------------------------------------------------------------------------------
             |             Linearized
      weight | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      height |    .708568          .        .       .            .           .
             |
         sex |
     Female  |   3.508388          .        .       .            .           .
       _cons |  -46.10337          .        .       .            .           .
------------------------------------------------------------------------------
Note: Missing standard errors because of stratum with single sampling unit.

The point estimates and \(R^2\) are the same, but Stata refuses to even calculate standard errors.