Probability of Adverse Events That Have Not Yet Occurred: A Statistical Reminder
Eypasch et al., 1995
A very quick discussion about the rule of three, establishing estimation of a confidence interval in scenarios where the prevalance of a rare event is 0.
Jovanovic & Levy, 19957
A more in-depth explanation, including derivations of the rule and a discussion of a correction for small sample sizes (n < 100).
Shows that leave-one-out cross validation (LOOCV) is asymptotically equivalent to AIC when used with mixed effects regression models.
When collecting longitudinal data, is it worth collecting intermediate data points, versus collecting alarger sample size with only two time points. The conclusion is that it is not worth collecting the intermediate data point unless the sample size is very large and non-linearity is strongly assumed.
The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis
Hoenig & Heisey, 2012
A discussion of the inappropriate use of post-hoc power analysis. They argue to never to do post-hoc power analyses, and suggest reporting confidence intervals instead of the power analysis as a method to describe the uncertainty and strength of evidence.
Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors
Gelman & Carlin, 2014
Oh No! I Got the Wrong Sign! What Should I Do?
Kennedy, 2010
Lists around 40 different reasons why a "wrong sign" can be obtained, and offers very brief advice on addressing it. It is published in the Econ literature so is specific there, but a lot of entries on the list are more generic.
Categories of examples include:
An excellent overview of the types of missing data (missing at random, not random, completely at random) and whether multiple imputation (MI) is needed or whether complete-case analysis is sufficient.
The Proportion of Missing Data Should Not Be Used to Guide Decisions on Multiple Imputation
Madley-Dowd et al., 2019
Shows that in situations with large proportion of missing data (up to 90%), there is still a benefit to carrying out MI. The paper also discusses the advice to simply do complete-case analysis if doing so will lead to under 5% of observations lost to missing data.
Power Analysis and Determination of Sample Size for Covariance Structure Modeling
MacCallum et al., 1996
Provides a way to estimate the sample size needed for various Structure Equation Models (SEM)/Path Analyses. The results are based upon the degrees of freedom of the model and the theorized effect size. Table 4 is particularly useful.
(Wanted: A good citation explaining how to estimate DF in an SEM, with visual examples!)
A discussion of when to use fixed effects regression models, versus when to use mixed effects regression models (also called "random effects regression models" in Econometrics).
Their primary argument for fixed effects is omitted variable bias and endogeneity at level 2 (group) does not impact level 1 (individual) estimates. A secondary argument is that mixed effects models require at least 30 groups. (Here's an online discussion arguing for lower thresholds with several different citations).
They also suggests another alternative, the "within-between mixed model", e.g. group-mean centering predictors and also including group means as predictors. This models splits within and between completely. For example, if you have some predictor X, include in the model both Xi (the group-level mean) and Xij - Xi (the variable, group-mean centered).
The Difference Between "Significant" and "Not Significant" is not Itself Statistically Significant
Gelman & Stern, 2006
This paper argues that p-values are noisy, in the sense that small changes in the data can introduce large changes in p-values - even moving between significance and non-significance.
...even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.
A nice summary of the issues around p-values.
Redefine statistical significance.
Benjamin et al., 2018
Contrary to the prior papers, this one suggests continuing to use p-values, but to use α =.005 instead of α = .05 for significance in exploratory analysis. They also suggest calling .005 < p < .05 as "suggestive".
They use a Bayesian approach, arguing that α = .005 corresponds to a Bayes Factor of 14 to 26, which translates (according to Kass & Rafferty (1993)) to "substantial" to "strong" evidence.
Rhemtulla et al., 2012
A simulation showing when categorical variables can be treated as continuous without bias, specifically for confirmatory factor analysis (CFA). Their conclusion:
Our findings confirm that when observed variables have fewer than five categories, robust categorical methodology is best. With five to seven categories, both methods yield acceptable performance.
On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals
Schenker & Gentleman, 2001
Briefly:
Two confidence intervals overlapping and still be statistically significant via a hypothesis stest is more likely when the standard errors are similar.
Linear regression and the normality assumption
Schmidt & Finan, 2018
A discussion of the importance of the normality assumption in OLS linear regression. From the abstract:
... violations of the normality assumption in linear regression analyses do not [bias point estimates]. The normality assumption is necessary to unbiasedly estimate standard errors, and hence confidence intervals and P-values. However, in large sample sizes (e.g., where the number of observations per variable is >10) violations of this normality assumption often do not noticeably impact results.
The Importance of the Normality Assumption in Large Public Health Data Sets
Lumley et al., 2002
Another older discussion about the normality assumption in linear regression. It only matters for very small sample sizes (e.g. <100).
Robustness of Linear Mixed-Effects Models to Violations of Distributional Assumptions
Schielzeth et al., 2020
Model estimates were usually robust to violations of assumptions, with the exception of slight upward biases in estimates of random effect variance if the generating distribution was bimodal but was modelled by normal error distributions. Further, estimates for random effect components that violated distributional assumptions became less precise but remained unbiased. However, this particular problem did not affect other parameters of the model.
In other words, normality is not that important!
Determining Power and Sample Size for Simple and Complex Mediation Models
Schoemann et al., 2017
A method for performing Monte Carlo Markov Chain (MCMC) estimation of power analysis for mediation models. They also provide this tool: https://schoemanna.shinyapps.io/mc_power_med/. It's a bit slow and as with any complex model power calculation, requires a lot of assumptions or knowledge of the data.
Why Test for Proportional Hazards?
Stensrud & Hernan, 2020
Three key claims:
... a hazard ratio from a Cox model needs to be interpreted as a weighted average of the true hazard ratios over the entire follow-up period.
A rebuttal of the typical "5 imputations is enough" recommendation.
The m = 5 (or as this paper puts it, 2-10) is sufficient for efficient estimation of point estimation, not for standard error of point estimate. This paper suggests 200 imputations for sample sizes where this reasonable.
They also introduce a two-stage procedure to produce an estimate of the total number of imputations needed. It is implemented in Stata and SAS by von Hippel, and in R via the howManyImputations package.
Practical advice on the usage and implementation of synthetic control method.
Log-likelihood-based Pseudo-R2 in Logistic Regression: Deriving Sample-sensitive Benchmarks
Hemmert et al., 2016
This paper discusses the various pseudo-R2 measures available for logistic regression, and their sensitivy to parameters such as sample size.
It is demonstrated that predicted values after fitting a model on multiply imputed data can be obtained from the final model, and results are equivalent (in the case of linear transformations) or almost equivalent (for non-linear transformations) to the results from using Rubin's rules on the individual models.
Inappropriate Use of Bivariable Analysis to Screen Risk Factors for Use in Multivariable Analysis
Sun et al., 1996
An article discussing how using t-tests to screen predictors in a model is inappropriate.
We believe that the use of the BVS [BiVariable Selection] method in multivariable analysis is one of the most common errors in data analysis in the current literature.
Note that this article is far from perfect; it does advocate for stepwise selection techniques, which are also very inappropriate in most settings. It is primarily interesting because of it's age; bivariate screening being inappropriate is not a new concern.
This work is licensed under CC BY-NC 4.0