A common task is to select a random subset of rows from your data set. This document discusses an easy way to do this, including sampling from subsamples.

## Data

We’ll load up the “auto” data set and shrink it down substantially in order to be able to print out the results.

. sysuse auto
(1978 Automobile Data)

. keep make foreign

. bysort foreign: gen row = _n

. keep if row <= 4
(66 observations deleted)

. drop row

. list, sep(0)

+--------------------------+
| make             foreign |
|--------------------------|
1. | AMC Concord     Domestic |
2. | AMC Pacer       Domestic |
3. | AMC Spirit      Domestic |
4. | Buick Century   Domestic |
5. | Audi 5000        Foreign |
6. | Audi Fox         Foreign |
7. | BMW 320i         Foreign |
8. | Datsun 200       Foreign |
+--------------------------+



## Simple Random Sample

Let’s say we want to select 4 rows, as a simple random sample. That is, the probability of any row being included in the sample is equal.

First, we’ll generate a random number per row. You can use any distribution you want; uniform or normal are common.

. generate rand = rnormal()

. list, sep(0)

+--------------------------------------+
| make             foreign        rand |
|--------------------------------------|
1. | AMC Concord     Domestic   -.4705035 |
2. | AMC Pacer       Domestic   -.3938664 |
3. | AMC Spirit      Domestic   -.2524172 |
4. | Buick Century   Domestic   -1.404408 |
5. | Audi 5000        Foreign   -.8082101 |
6. | Audi Fox         Foreign   -.0387205 |
7. | BMW 320i         Foreign    1.185362 |
8. | Datsun 200       Foreign   -.2958094 |
+--------------------------------------+



rnormal takes in 2 optional arguments of a mean and standard deviation; the defaults are 0 and 1 respectively.

If you prefer uniform, you call generate rand = runiform(a, b) where a and b are upper and lower bounds, e.g. generate rand = runiform(0, 1).

Now we simply sort by this new variable.

. sort rand

. list, sep(0)

+--------------------------------------+
| make             foreign        rand |
|--------------------------------------|
1. | Buick Century   Domestic   -1.404408 |
2. | Audi 5000        Foreign   -.8082101 |
3. | AMC Concord     Domestic   -.4705035 |
4. | AMC Pacer       Domestic   -.3938664 |
5. | Datsun 200       Foreign   -.2958094 |
6. | AMC Spirit      Domestic   -.2524172 |
7. | Audi Fox         Foreign   -.0387205 |
8. | BMW 320i         Foreign    1.185362 |
+--------------------------------------+



Finally, we can identify our sample.

. gen insample = _n <= 4

. list, sep(0)

+-------------------------------------------------+
| make             foreign        rand   insample |
|-------------------------------------------------|
1. | Buick Century   Domestic   -1.404408          1 |
2. | Audi 5000        Foreign   -.8082101          1 |
3. | AMC Concord     Domestic   -.4705035          1 |
4. | AMC Pacer       Domestic   -.3938664          1 |
5. | Datsun 200       Foreign   -.2958094          0 |
6. | AMC Spirit      Domestic   -.2524172          0 |
7. | Audi Fox         Foreign   -.0387205          0 |
8. | BMW 320i         Foreign    1.185362          0 |
+-------------------------------------------------+



Recall that _n refers to the current row number, so this is just flagging all rows 4 and below!

## Sample by Subgroup

Consider the sample we obtained above, and notice that we sampled 3 domestic cars and 1 foreign car. Since it was a simple random sample, that split is random; we could have just as easily obtained all foreign cars or any other combination. Perhaps we want to force some balance, for example, that our random sample is exactly 2 foreign and 2 domestic.

We’ll generate a new random number first just as before.

. drop rand insample

. generate rand = rnormal()

. list, sep(0)

+--------------------------------------+
| make             foreign        rand |
|--------------------------------------|
1. | Buick Century   Domestic   -1.179235 |
2. | Audi 5000        Foreign    1.503948 |
3. | AMC Concord     Domestic    .0767283 |
4. | AMC Pacer       Domestic    -.627642 |
5. | Datsun 200       Foreign   -1.122534 |
6. | AMC Spirit      Domestic   -1.491838 |
7. | Audi Fox         Foreign    .0291835 |
8. | BMW 320i         Foreign   -.7714012 |
+--------------------------------------+



Now when we sort, we’ll sort by foreign first.

. sort foreign rand

. list, sep(0)

+--------------------------------------+
| make             foreign        rand |
|--------------------------------------|
1. | AMC Spirit      Domestic   -1.491838 |
2. | Buick Century   Domestic   -1.179235 |
3. | AMC Pacer       Domestic    -.627642 |
4. | AMC Concord     Domestic    .0767283 |
5. | Datsun 200       Foreign   -1.122534 |
6. | BMW 320i         Foreign   -.7714012 |
7. | Audi Fox         Foreign    .0291835 |
8. | Audi 5000        Foreign    1.503948 |
+--------------------------------------+



So we have two separate randomly sorted lists here. To select a fixed number from each, we can use the bysort prefix.

. bysort foreign (rand): gen rownumber = _n

. gen insample = rownumber <= 2

. list, sep(0)

+------------------------------------------------------------+
| make             foreign        rand   rownum~r   insample |
|------------------------------------------------------------|
1. | AMC Spirit      Domestic   -1.491838          1          1 |
2. | Buick Century   Domestic   -1.179235          2          1 |
3. | AMC Pacer       Domestic    -.627642          3          0 |
4. | AMC Concord     Domestic    .0767283          4          0 |
5. | Datsun 200       Foreign   -1.122534          1          1 |
6. | BMW 320i         Foreign   -.7714012          2          1 |
7. | Audi Fox         Foreign    .0291835          3          0 |
8. | Audi 5000        Foreign    1.503948          4          0 |
+------------------------------------------------------------+



(Recall that when calling bysort, any argument in parentheses is used for sorting, not for by’ing. Since I sorted by foreign and rand above I probably could have just used the prefix by foreign:, however, I prefer always using bysort with full sorting just to avoid any issues.)

We could have also enforced an unequal split in foreign:

. gen insample2 = rownumber <= 3 if foreign == 0
(4 missing values generated)

. replace insample2 = rownumber <= 1 if foreign == 1

. list, sep(0)

+-----------------------------------------------------------------------+
| make             foreign        rand   rownum~r   insample   insamp~2 |
|-----------------------------------------------------------------------|
1. | AMC Spirit      Domestic   -1.491838          1          1          1 |
2. | Buick Century   Domestic   -1.179235          2          1          1 |
3. | AMC Pacer       Domestic    -.627642          3          0          1 |
4. | AMC Concord     Domestic    .0767283          4          0          0 |
5. | Datsun 200       Foreign   -1.122534          1          1          1 |
6. | BMW 320i         Foreign   -.7714012          2          1          0 |
7. | Audi Fox         Foreign    .0291835          3          0          0 |
8. | Audi 5000        Foreign    1.503948          4          0          0 |
+-----------------------------------------------------------------------+