Chapter 2 Summarizing Data
Before we consider modeling the data in any meaningful way, it is important to explore the data to get some sense of what the data looks like, as certain modeling decisions will depend on the structure of the data. This chapter and the next will cover how to examine and visualize the data.
We will be using the builtin data set “auto” for a lot of the examples. If you’re not familiar with the sysuse
command, it functions similarly to use
, except only loads one of the several builtin Stata data sets.
. sysuse auto, clear
(1978 Automobile Data)
The “auto” data set contains characteristics on a number of cars from 1978.
2.1 describe
, summarize
, codebook
The majority of the Stata modeling commands require all variables to be numeric. String variables can be used in some places, but there are plenty of times when you might expect them to work, but they don’t. As a result, I recommend converting all categorical variables into numeric. To help with this, the describe
command can tell us what variables are string and which aren’t.
. describe
Contains data from /Applications/Stata/ado/base/a/auto.dta
obs: 74 1978 Automobile Data
vars: 12 13 Apr 2018 17:45
(_dta has notes)

storage display value
variable name type format label variable label

make str18 %18s Make and Model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair Record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear Ratio
foreign byte %8.0g origin Car type

Sorted by: foreign
Here we can see that “make” is a string; but make is unique per row (it identifies the make and model of each car) so it’s not something we’re going to use in the model. If you wanted to use string functions (see help string functions
for details) to extract out the manufacturer of each car (e.g. there are 7 Buicks in the data), that resultant “manufacturer” variable would be something we’d need to convert to a numeric. The main tools you’d need would be destring
(which converts numeric values saved as strings into numbers) and encode
(which converts strings to numerics with appropriate value labels).
describe
is also useful to get a sense of the size of your data.
Once we’ve taken a look at the structure of the data, we can start exploring each variable. The summarize
and codebook
commands contains almost the same information, presented in slightly different ways. It can be useful to look at both. For example,
. summ price, detail
Price

Percentiles Smallest
1% 3291 3291
5% 3748 3299
10% 3895 3667 Obs 74
25% 4195 3748 Sum of Wgt. 74
50% 5006.5 Mean 6165.257
Largest Std. Dev. 2949.496
75% 6342 13466
90% 11385 13594 Variance 8699526
95% 13466 14500 Skewness 1.653434
99% 15906 15906 Kurtosis 4.819188
. codebook price

price Price

type: numeric (int)
range: [3291,15906] units: 1
unique values: 74 missing .: 0/74
mean: 6165.26
std. dev: 2949.5
percentiles: 10% 25% 50% 75% 90%
3895 4195 5006.5 6342 11385
Things to look for here include

Values which are outside of expected values. The
summarize
commands gives the 1st and 99th percentiles (1% and 99% of values are below those thresholds, respectively) andcodebook
gives the range. If, for example, we saw a minimum value of 203 or a maximum value of 145200 (keep in mind these are 1978 dollars!), that’s an indication that there is an issue with the data, likely a mistake.  The mean is as expected. If this is higher or lower than expected, it might be an indication of skew or the existence of outliers. If it is very close to the minimum or maximum value, perhaps you have a point mass (e.g. if you polled 1821 year old’s on their number of children, there would be a lot of 0’s but a few nonzeros).
 If the standard deviation is very small (relative to the mean), then the variable has very consistent values. A standard deviation of 0 indicates a constant.

The
codebook
reports the number of missing; if you have missing data, double check that it is not an error in the data. Perhaps multiple imputation is needed. 
If the variable is categorical (e.g. race), is the number of unique entries reported in the
codebook
as expected?
2.2 mean
The mean
command gives summary statistics on the mean of a variable.
. mean price
Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
price  6165.257 342.8719 5481.914 6848.6

These are characteristics of the estimated mean of the “price” variable. The standard deviation reported from the summarize
command above represents the variability among individual cars; the standard error reported by mean
the variability of means: if we were to repeatedly draw samples of size 74, the standard error is a measure of the variability of the means from all those samples.
The confidence interval is interpreted as if we were to continue drawing those samples of size 74, we would expect 95% of those samples to have an estimated mean within those bounds. It is not that we’re 95% confident that the true population mean falls in that range  either it does or it doesn’t!
2.3 Estimation Commands
The introduction of mean
allows us to discuss estimation commands. An estimation command is any command that fits a statistical model  some of these are obvious such as regress
for linear regression, but others such as mean
which we just ran are also estimation commands because it is estimating a confidence interval. summarize
is not because it only provides statistics about the current sample instead of making inference into the population.
Almost all estimation commands have the same general syntax:
command varlist [if] [in] [weight] [,options]
The sections inside [
and ]
are optional. The command
can sometimes consist of a main command and one or more subcommands. The varlist
can be empty, have a single entry, or have multiple entries (the order of which is sometimes of importance  generally the first is some outcome or dependent variable and the rest are predictors or independent variables).^{1}
Estimation commands are stored after they are run, and persist regardless of how many other nonestimation commands are run in between them. These nonestimation commands include data manipulation and postestimation commands. As soon as another estimation command is run, the first is dropped and the new one is saved.
This allows interesting things such as replaying a command (calling the estimation command again without any varlist
to redisplay it’s results) even if the data is gone!
. clear
. list
. mean
Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
price  6165.257 342.8719 5481.914 6848.6

A larger benefit of this is that if you are fitting a model on one data set and want to get predicted values on another, you could do something like this (this is pseudocode, not real Stata!):
use fitting_data
model y x1 x2
use newdata, clear
predict fitted
2.3.1 Postestimation commands
Since the last estimation command is saved, any commands which need to reference it (called postestimation commands) do so inherently, no need to specify. For example, let’s reload the data and run mean
on a few variables.
. sysuse auto, clear
(1978 Automobile Data)
. mean mpg headroom length
Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
mpg  21.2973 .6725511 19.9569 22.63769
headroom  2.993243 .0983449 2.797242 3.189244
length  187.9324 2.588409 182.7737 193.0911

Let’s say we want to obtain the correlation matrix^{2}
. estat vce, corr
Correlation matrix of coefficients of mean model
e(V)  mpg headroom length
+
mpg  1.0000
headroom  0.4138 1.0000
length  0.7958 0.5163 1.0000
Here we see that both length
and headroom
are negatively correlated with mpg
; as the car gets larger, its mileage decreases. Headroom and length are positively correlated, so cars aren’t just growing in one direction!
The estat
command is somewhat generic, we will see other uses of it later.
Similar to how you can get help with any command with help, e.g. help mean
, you can get a list of all postestimation commands that a given estimation command supports:
help mean postestimation
There is also a link to the postestimation page in the help for the estimation command.
2.3.2 Storing and restoring estimation commands
The obvious downside to Stata’s approach to saving the most recent estimation command is that you lose all earlier commands. If you have only a limited number of commands and each is fast, this isn’t a big deal. However, with some more advanced approaches, modeling can become very slow, so you may not want to lose the results. Stata has a solution for this, allowing us to store and recall estimation commands without having to rerun them. This has an obvious parallel to the preserve
/restore
commands that affect the data.
You have the choice of saving the results temporarily (in memory) or permanently (to a file). There are the obvious pro’s and con’s to each approach. For these notes I will focus primarily on storing the results in memory, but I will point out where the commands differ if saving to a file. Let’s run a fresh mean
call to work with. The estimates
command will be used.
. mean price mpg
Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
price  6165.257 342.8719 5481.914 6848.6
mpg  21.2973 .6725511 19.9569 22.63769

. estimates query
(active results produced by mean; not yet stored)
The query
subcommand tells us what estimation command was last run, and whether it has already been saved. Here it has not. Let’s save these results.
. estimates store mean1
To save to a file, use estimates save
instead. Now let’s run a second mean
commands.
. mean mpg headroom length
Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
mpg  21.2973 .6725511 19.9569 22.63769
headroom  2.993243 .0983449 2.797242 3.189244
length  187.9324 2.588409 182.7737 193.0911

. est store mean2
. est query
(active results produced by mean; also stored as mean2)
Now query
is telling us that the current estimation commands are (obviously) stored as “mean2”. Let’s use estimates restore
to jump between the two. (If saving to a file, use estimates use
instead.)
. est restore mean1
(results mean1 are active now)
. estat vce, corr
Correlation matrix of coefficients of mean model
e(V)  price mpg
+
price  1.0000
mpg  0.4686 1.0000
. est query
(active results produced by mean; also stored as mean1)
To “replay” an estimation command (redisplay the results without rerunning the model), you can either restore it and call the blank command again:
. est restore mean2
(results mean2 are active now)
. mean
Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
mpg  21.2973 .6725511 19.9569 22.63769
headroom  2.993243 .0983449 2.797242 3.189244
length  187.9324 2.588409 182.7737 193.0911

or use estimates replay
directly:
. est query
(active results produced by mean; also stored as mean2)
. est replay mean1

Model mean1

Mean estimation Number of obs = 74

 Mean Std. Err. [95% Conf. Interval]
+
price  6165.257 342.8719 5481.914 6848.6
mpg  21.2973 .6725511 19.9569 22.63769

One use of stored estimates that can be useful is creating a table to include all the results.
. est table mean1 mean2

Variable  mean1 mean2
+
price  6165.2568
mpg  21.297297 21.297297
headroom  2.9932432
length  187.93243

If you are familiar with regression, you should be able to see how useful this might be!
Finally, we can see all saved estimates with dir
, drop a specific estimation command with drop
, or remove all with clear
:
. est dir

name  command depvar npar title
+
mean1  mean Mean 2
mean2  mean Mean 3

. est drop mean1
. est dir

name  command depvar npar title
+
mean2  mean Mean 3

. est clear
. est dir
2.4 tab
Continuing on with exploring the data, categorical variables are not summarized well by the mean. Instead, we’ll look at a tabulation.
. tabulate rep78
Repair 
Record 1978  Freq. Percent Cum.
+
1  2 2.90 2.90
2  8 11.59 14.49
3  30 43.48 57.97
4  18 26.09 84.06
5  11 15.94 100.00
+
Total  69 100.00
This gives us the count at each level, the percent at each level, as well as the cumulative percent (e.g. 57.97% of observations have a value of 3 or below). The cumulative percentage is only informative for an ordinal variable (a categorical variable that has an ordering too it), and not an unordered categorical variable such as race.
Note that it is counting a total of 69 observations to total 100% of the data. However, you may have noticed earlier that we have 74 rows of data. By default, tabulate
does not include any information about missing values. The missing
option corrects that.
. tab rep78, missing
Repair 
Record 1978  Freq. Percent Cum.
+
1  2 2.70 2.70
2  8 10.81 13.51
3  30 40.54 54.05
4  18 24.32 78.38
5  11 14.86 93.24
.  5 6.76 100.00
+
Total  74 100.00
It’s important to keep in mind the difference between the percentages of the two outputs. For example, 11.59% of nonmissing values of rep78
are 2, whereas only 10.81% of all values are 2.
There are a few other options related to how the results are visualized which we will not cover.
2.4.1 Twoway tables
We will cover twoway tables (also known as “crosstabs”) later in univariate analysis, but there is a peculiarity to tab
related to it. If you pass two variables to tab
, it creates the crosstab:
. tab rep78 foreign, missing
Repair 
Record  Car type
1978  Domestic Foreign  Total
++
1  2 0  2
2  8 0  8
3  27 3  30
4  9 9  18
5  2 9  11
.  4 1  5
++
Total  52 22  74
2.4.2 Generating dummy variables
Although Stata has excellent categorical variable handling capabilities, you may occasionally have the situation where you want the dummy variables instead of a category. For an example of the difference, consider a “campus” variable with three options, “central”, “north” and “medical”. Imagine our data looks like:
id

campus

campuscentral

campusnorth

campusmedical


1  north  0  1  0 
2  central  1  0  0 
3  north  0  1  0 
4  north  0  1  0 
5  medical  0  0  1 
Notice that the information in campus
and the information encoded in campuscentral
, campusnorth
, and campusmedical
are identical. A 1 in the campus____
variables represents “True” and 0 represents “False”, and only a single 1 is allowed per row.
As mentioned, we will most of the time use categorical variables such as campus
over dummy variables like campus_____
(these are used in the actual model, but Stata creates them for you behind the scenes so you don’t need to worry about them), but if necessary, you can create the dummy variables using tab
:
. list rep* in 1/5
++
 rep78 

1.  3 
2.  3 
3.  . 
4.  3 
5.  4 
++
. tab rep78, gen(reps)
Repair 
Record 1978  Freq. Percent Cum.
+
1  2 2.90 2.90
2  8 11.59 14.49
3  30 43.48 57.97
4  18 26.09 84.06
5  11 15.94 100.00
+
Total  69 100.00
. list rep* in 1/5
++
 rep78 reps1 reps2 reps3 reps4 reps5 

1.  3 0 0 1 0 0 
2.  3 0 0 1 0 0 
3.  . . . . . . 
4.  3 0 0 1 0 0 
5.  4 0 0 0 1 0 
++
If you are not familiar with the list
command, it prints out data. Giving it a variable (or multiple) restricts it to those (here we restricted it to rep*
, which is any variable that starts with “rep”  the * is a wildcard), and the in
statement restricts to the first 5 observations (we just want a quick visualization, not to print everything).
Take note of how the missing value is propogated when creating the dummies.
2.5 correlate
With the use of tab
for crosstabs, we’ve left univariate summaries and moved to joint summaries. For continuous variables, we can use the correlation to examine how similar two continuous variables are. The most common version of correlation is Pearson’s correlation, which ranges from 1 to 1.
 A value of 0 represents no correlation; having information about one variable provides no insight into the other.
 A value towards 1 represents positive correlation; as one value increases, the other tends to increase as well. A correlation of 1 would be perfect correlation; the two variables differ by only a transformation, e.g. your height in inches and centimeters.
 A value towards 1 represents negative correlation; as one value increases, the other tends to decreases (and viceversa). A correlation of 1 would be perfect negative correlation, e.g. during a road trip, your distance travel is perfectly negatively correlated with your distance to you destination (ignoring detours).
We can calculate the Pearson’s correlation with correlate
.
. correlate weight length
(obs=74)
 weight length
+
weight  1.0000
length  0.9460 1.0000
This produces whats known as the correlation matrix. The diagonal entries are both 1, because clearly each variable is perfectly correlated with itself! The offdiagonal entries are identical since correlation is a symmetric operation. The value of .95 is extremely close to one, as we would expect  longer cars are heavier and perhaps viceversa. Another way to think of it is that once we know weight
, learning length
does not add much information. On the other hand,
. corr price turn
(obs=74)
 price turn
+
price  1.0000
turn  0.3096 1.0000
with a correlation of .31, learning turn
when you already know price
does add a lot of information.
We can look at multiple correlations at once as well.
. corr mpg weight length
(obs=74)
 mpg weight length
+
mpg  1.0000
weight  0.8072 1.0000
length  0.7958 0.9460 1.0000
We see the .9460 we saw earlier, but notice also that mpg
is negatively correlated with both weight
and length
 a larger car gets worse mileage and low mileage cars tend to be large. A few notes:

The amount of information contained is irrespective of the sign; knowing the
mpg
of a car, adding information about itsweight
doesn’t add much information. 
The two correlations with
mpg
are extremely similar. We might generally expect that, given thatweight
andlength
are so strongly correlated. Note that despite that we expect that, it is not a rule  it is entirely possible (though unlikely) that the correlations withmpg
could be very dissimilar.
What are thresholds for a “low” or “high” correlation? This will depend greatly on your field and setting, a common metric is that .3 is a low correlation, .6 is a moderate correlation, and .8 is a strong correlation.
It is possible to obtain pvalues testing whether each of those correlations are significantly distinguishable from 0 by passing the sig
option. However, pvalues for correlations are extremely dependent on sample size and should not be trusted.^{3} The magnitude of the correlation is much more important than it’s significance.
2.5.1 varlists in Stata
Consider if we wanted to look at all the continuous variables in the data. We could write corr price mpg …
and make a very long command. The collection of all variables would be a “varlist”. Stata has several ways of short cutting this.
The first we’ve already seen when we used the wildcard “*” above. We can use * anywhere in the variable name to denote any number of additional characters. E.g. “thisvar” matches “thisvar”, “thisnewvar”, “thisvar”, “thisHFJHDJSHFKDHFKSHvar”, etc. A second wildcard, “?”, represents just a single variable, so “thisvar” would match only “thisvar” from that list, as well as “thisAvar”, “thisJvar”, etc.
Secondly, we can match a subset of variables that are next to each other using “”. All variable, starting with the one to the left of the  and ending with the one to the right of the  are included. For example,
. desc, simple
make headroom turn reps1 reps5
price trunk displacement reps2
mpg weight gear_ratio reps3
rep78 length foreign reps4
. desc trunkturn
storage display value
variable name type format label variable label

trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn Circle (ft.)
We can combine those two, as well as specifying individual variables.
. corr pricerep78 t* displacement
(obs=69)
 price mpg rep78 trunk turn displa~t
+
price  1.0000
mpg  0.4559 1.0000
rep78  0.0066 0.4023 1.0000
trunk  0.3232 0.5798 0.1572 1.0000
turn  0.3302 0.7355 0.4961 0.6008 1.0000
displacement  0.5479 0.7434 0.4119 0.6287 0.8124 1.0000
price
, mpg
and rep78
are included as part of pricerep78
, t*
matches trunk
and turn
, and displacement
is included by itself.
Finally, there is the special variable list _all
, which is shorthand for all variables (e.g. firstvarlastvar
). It is accepted in most but not all places that take in variables.
. corr _all
(make ignored because string variable)
(obs=69)
 price mpg rep78 headroom trunk weight length
+
price  1.0000
mpg  0.4559 1.0000
rep78  0.0066 0.4023 1.0000
headroom  0.1112 0.3996 0.1480 1.0000
trunk  0.3232 0.5798 0.1572 0.6608 1.0000
weight  0.5478 0.8055 0.4003 0.4795 0.6691 1.0000
length  0.4425 0.8037 0.3606 0.5240 0.7326 0.9478 1.0000
turn  0.3302 0.7355 0.4961 0.4347 0.6008 0.8610 0.8631
displacement  0.5479 0.7434 0.4119 0.4763 0.6287 0.9316 0.8621
gear_ratio  0.3802 0.6565 0.4103 0.3790 0.5107 0.7906 0.7232
foreign  0.0174 0.4538 0.5922 0.3347 0.4053 0.6460 0.6110
reps1  0.0945 0.0086 0.4230 0.2550 0.2175 0.0149 0.0054
reps2  0.0223 0.1346 0.5180 0.1603 0.0586 0.1480 0.1778
reps3  0.0859 0.2796 0.3622 0.1726 0.2724 0.2975 0.2218
reps4  0.0153 0.0384 0.3592 0.0195 0.0589 0.1223 0.0909
reps5  0.0351 0.4542 0.7065 0.2337 0.2498 0.3925 0.3492
 turn displa~t gear_r~o foreign reps1 reps2 reps3
+
turn  1.0000
displacement  0.8124 1.0000
gear_ratio  0.7005 0.8381 1.0000
foreign  0.6768 0.6383 0.7266 1.0000
reps1  0.0471 0.0131 0.0355 0.1143 1.0000
reps2  0.2939 0.1733 0.2468 0.2395 0.0626 1.0000
reps3  0.2526 0.3038 0.2449 0.3895 0.1515 0.3176 1.0000
reps4  0.1748 0.1231 0.2287 0.2526 0.1026 0.2151 0.5211
reps5  0.4110 0.4093 0.2894 0.4863 0.0752 0.1577 0.3820
 reps4 reps5
+
reps4  1.0000
reps5  0.2587 1.0000
Notice that it automatically ignored the string variable make
. Not all commands will work this well, so _all
may occasionally fail unexpectedly.
2.5.2 Pairwise completion vs complete case
You may have noticed that the cor
command reports the number of observations it used, for example, the first few correlations all used 74 observations, but the _all
version used on 69. correlate
uses what’s known as complete cases analysis  any observation missing any value used in the command is excluded. rep78
is missing 5 observations (run the misstable summarize
command to see this).
On the other hand, pairwise completion only excludes missing values from the relevant comparisons. If a given correlation doesn’t involve rep78
, it will use all the data. We can obtain this with pwcorr
.
. corr rep78 price trunk
(obs=69)
 rep78 price trunk
+
rep78  1.0000
price  0.0066 1.0000
trunk  0.1572 0.3232 1.0000
. pwcorr rep78 price trunk
 rep78 price trunk
+
rep78  1.0000
price  0.0066 1.0000
trunk  0.1572 0.3143 1.0000
Notice the two correlations involving rep78
are identical  the same set of observations are dropped in both. However, the correlation between price
and trunk
differs  in correlate
, it is only using 69 observations, whereas in pwcorr
it uses all 74.
It may seem that pwcorr
is always superior (and, in isolation it is). However, most models such as regression only support complete cases analysis, so in those cases, if you are exploring your data, it does not make sense to do pairwise comparison. Ultimately, the choice remains up to you. If the results from correlate
and pwcorr
do differ drastically, that is a sign of something else going on!
2.5.3 Spearman correlation
One limitation of Pearson’s correlation is that it is detecting linear relationships only. A famous example of this is Anscombe’s quartet:
In each pair, the Pearson correlation is an identical .8162! In the first, that’s what we want. In the second, the relationship is strong but nonlinear. In the third, only one value is not perfectly correlated, so the Pearsons correlation is diminished. In the fourth, only the existence of the single outlier is driving the relationship.
Spearman correlation is an alternative to Pearson correlation. It works by ranking each variable and then performing Pearson’s correlation. The command in Stata is spearman
.
. corr price trunk
(obs=74)
 price trunk
+
price  1.0000
trunk  0.3143 1.0000
. spearman price trunk, matrix
(obs=74)
 price trunk
+
price  1.0000
trunk  0.3996 1.0000
The matrix
option forces output to mirror correlate
, otherwise it produces a slightly different output when given only two variables. spearman
uses complete cases; to use pairwise complete instead, pass the option pw
:
. spearman mpgheadroom, pw
(obs=varies)
 mpg rep78 headroom
+
mpg  1.0000
rep78  0.3098 1.0000
headroom  0.4866 0.1583 1.0000
How does Spearman’s correlation compare to Pearson’s for Anscombe’s quartet?
Comparison  Pearson  Spearman 

\(y_1, x_1\)  .8162  .8182 
\(y_2, x_2\)  .8162  .6909 
\(y_3, x_3\)  .8162  .9909 
\(y_4, x_4\)  .8162  .5000 
The second correlation diminishes, the third drastically increases, and the fourth decreases as well.
2.6 Exercise 1
For these exercises, we’ll be using data from NHANES, the National Health And Nutrition Examination Survey. The data is on Stata’s website, and you can load it via
webuse nhanes2, clear

Use
describe
to get a sense of the data. How many observations? How many variables? 
Use
tab
,summarize
,codebook
, and/ormean
to get an understanding of the some of variables that we’ll be using going forward:
region

sex

diabetes

lead


Does
race
have any missing data? Doesdiabetes
? Doeslead
?  What is more highly correlated? A person’s height and weight, or their diastolic and systolic blood pressure?