Modifying a Built-in Stata Command
I recently came across an issue in which modifying a shipped-with-Stata command was the easier way to address. I’m writing this to document the approach I took in case it comes up again and proves useful to remember the steps I took.
(Note: If you see ...
in Stata code snippets below, it
indicates portions of code I excluded for brevity.)
The Problem
A client was working with survival data and using the sts
test
command to show that the number of observed and expected
events was similar. The sts test
command produces a
p-value, as shown below, but the journal the client was submitting to
rightly prefers confidence intervals over p-values. So the client was
hoping to bootstrap sts test
to generate confidence
intervals for the expected number of events. The problem was
that sts test
returns very little information and thus
bootstrap wouldn’t work as normal (as it requires return
’d
or ereturn
’d values to operate).
. webuse stan3 (Heart transplant data) . sts test posttran Failure _d: died Analysis time _t: t1 ID variable: id Equality of survivor functions Log-rank test | Observed Expected posttran | events events ---------+------------------------- 0 | 30 31.20 1 | 45 43.80 ---------+------------------------- Total | 75 75.00 chi2(1) = 0.13 Pr>chi2 = 0.7225 . return list scalars: r(chi2) = .1261354821436887 r(df) = 1
While I could probably have figured out the math that Stata uses to
calculate those expected events, it seemed probably easier and
definitely more fun to hack the sts test
command.
Finding Where to Modify
The which
command will tell you whether a command exists in
an ado file, or is “built-in”, which cannot be modified as easily. Since
the “test” of sts test
is a subcommand, we can find
where sts
is defined.
. which sts /Applications/Stata/ado/base/s/sts.ado *! version 8.8.0 23apr2022
Opening that file, I quickly found the test
subcommand:
program define Test, rclass ...
Searching the file for “Expected” or “events” or any other text in
the sts test
output came up blank, so the actual
calcuations must take place elsewhere.
The sts test
command runs a variety of different tests, and
I noticed that a good chunk of the code was dedicated to determining
which test the user request, and setting the cmd
macro.
... else if "`tware'"~="" { local cmd "tware_st" } ... else if "`peto'"~="" { local cmd "peto_st" } ... else local cmd "logrank" ...
Finally, near the bottom of the command, the cmd
macro is
used:
... `vv' `cmd' _t _d `w' if `touse', strata(`strata') /* */ t0(_t0) `id' `by' `options' `detail' `trend' `p' `q' ...
Ignoring the vv
macro (which handles version
if applicable), it’s clear that the cmd
macro must be an
actual command. And the particular test the client was working with was
the logrank test, therefore I was looking for the logrank
command.
. which logrank /Applications/Stata/ado/base/l/logrank.ado *! version 7.1.17 10nov2021
Inside I found the following code:
... di in smcl in gr _n _col(`len') `" {c |} Observed Expected"' local pad = `len' - `len1' if `"`strata'"'==`""' { local dup `" events"' } else local dup `" events*"' di in smcl in gr `"`ttl'"' _skip(`pad') `"{c |} events `dup'"' di in smcl in gr "{hline `len'}{c +}{hline 25}" local sum 0 local i 1 local gstr = (bsubstr("`:type `grp''", 1, 3)=="str") while `i' <= _N { if (`gstr') { local x : di udsubstr(`grp'[`i'], 1, 255) } else { local x = `grp'[`i'] } local pad = `len' - udstrlen(`"`x'"')-1 di in smcl in gr _skip(`pad') `"`x' "' "{c |}" in ye /* */ %10.0g `wo'[1,`i'] `" "' %10.2f `w'[1,`i'] local sum = `sum' + `wo'[1,`i'] local i = `i' + 1 } di in smcl in gr "{hline `len'}{c +}{hline 25}" local pad = `len' - 6 di in smcl in gr _skip(`pad') `"Total "' `"{c |}"' in ye /* */ %10.0g `sum' `" "' %10.2f `sum' ...
This is very confusing code to look at at first (for some reason all
shipped-with-Stata code uses the shortest possible versions of the
command names, making things even more obtuse) but we can tell that this
is producing the table output we’re looking for. di
is display
, so each di
line is printing
something out. The first couple are printing “Observed”, “Expected”, and
“events”, so that’s the head of the table, and the last di
is printing “Total” which is the end of the table. The sts
test
(and logrank
) command takes in a categorical
variable and prints a row per level, so the while
loop must
be going through each level of the variable. Inside the loop, there is
only a single di
statement:
... di in smcl in gr _skip(`pad') `"`x' "' "{c |}" in ye /* */ %10.0g `wo'[1,`i'] `" "' %10.2f `w'[1,`i'] ...
Note the reference to two matrix extractions: wo[1, i]
and w[1, i]
. These insert the observed (wo
)
and expected (w
) number of events!
Making the Modification
So ultimately, we just need to return the w
matrix. Finding
the other returns,
... ret scalar df = colsof(`w') - 1 ret scalar chi2 = `V'[1,1] ...
we can add our own return that stores each expected value into a scalar:
matrix events=`w' local i 1 while `i' <= _N { return scalar e`i' = events[1,`i'] local i = `i' + 1 }
Now we can save this, re-open Stata, and it works!
. webuse stan3 (Heart transplant data) . bootstrap e1=r(e1) e2=r(e2), reps(10): sts test posttran (running sts on estimation sample) warning: sts does not set e(sample), so no observations will be excluded from the resampling because of missing values or other reasons. To exclude observations, press Break, save the data, drop any observations that are to be excluded, and rerun bootstrap. Bootstrap replications (10): .........10 done Bootstrap results Number of obs = 172 Replications = 10 Command: sts test posttran e1: r(e1) e2: r(e2) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- e1 | 31.19955 7.380891 4.23 0.000 16.73327 45.66583 e2 | 43.80045 5.779255 7.58 0.000 32.47332 55.12758 ------------------------------------------------------------------------------
An Added Complication
The client was working on a virtual Windows machine which did not have
permission to overwrite the logrank.ado file. It is easy enough to make
a copy of logrank.ado, rename logrank
to logrank2
and treat it as a user-written ado file, but
we’d have to also create a custom version of sts test
which
might be slightly messy due to it being a subcommand.
Instead, we can use the sts.ado to figure out to get the comparable
logrank
command for a given sts test
command.
... `vv' `cmd' _t _d `w' if `touse', strata(`strata') /* */ t0(_t0) `id' `by' `options' `detail' `trend' `p' `q' ...
As mentioned above, vv
is just version control, so here we have
some variables passed to logrank
(_t
,
_d
, and whatever is inside w
), and a bunch of
options. We can modify our version of logrank
to print out
all these, to determine what is actually being passed.
program define logrank /* timevar [deadvar] [, by(group) t0(t0) id(tvid)] */, rclass version 6.0, missing syntax varlist(min=1 max=2) [if] [in] [fw iw] [, /* */ BY(varlist) CHECK Detail ID(varname) LOGRANK /* */ MAT(string) T0(varname) noTItle /* */ STrata(varlist) TVid(varname) trend DINOTE] display "varlist: `varlist'" display "t0: `t0'" display "id: `id'" etc...
Ultimately it ended up that the command was:
. logrank _t _d, by(posttran) id(id) t0(_t0) Equality of survivor functions Log-rank test | Observed Expected posttran | events events ---------+------------------------- 0 | 30 31.20 1 | 45 43.80 ---------+------------------------- Total | 75 75.00 chi2(1) = 0.13 Pr>chi2 = 0.7225