A few notes on agreement between raters.

Cohen’s \(\kappa\)

Cohen’s \(\kappa\) can be used for agreement between two raters on categorical data. The basic calculation is

\[ \kappa = \frac{p_a - p_e}{1 - p_e}, \]

where \(p_a\) is the percentage observed agreement and \(p_e\) is the percentage expected agreement by chance. Therefore \(\kappa\) is what percentage of the agreement over chance is observed.

Fleiss’ \(\kappa\) is an extension to more than two raters and has a similar form.

A major flaw in either \(\kappa\) is that for ordinal data, any disagreement is treated equal. E.g. on a Likert scale, ratings of 4 and 5 are just as disagreeable as ratings of 1 and 5. Weighted \(\kappa\) addresses this by including a weight matrix which can be used to provide levels of disagreement.

Intra-class correlation

ICC is used for continuous measurements. It can be used in place of weighted \(\kappa\) with ordinal variables of course. The basic calculation is

\[ ICC = \frac{\sigma^2_w}{\sigma^2_w + \sigma^2_b}, \]

where \(\sigma^2_w\) and \(\sigma^2_b\) represent within- and between- rater variability respectively. Since the denominator is the total variance of all ratings regardless of rater, this fraction represents the percent of total variation accounted for by within-variation.

The modern way to estimate the ICC is by a mixed model, extracting the \(\sigma\)’s that are needed.

ICC in R

Use the Orthodont data from nlme as our example. Look at distance measurements and look at correlation by Subject.

library(nlme)
library(lme4)
data(Orthodont)

With nlme

Using the nlme package, we fit the model:

fm1 <- lme(distance ~ 1, random = ~ 1 | Subject, data = Orthodont)
summary(fm1)
## Linear mixed-effects model fit by REML
##  Data: Orthodont 
##        AIC      BIC    logLik
##   521.3618 529.3803 -257.6809
## 
## Random effects:
##  Formula: ~1 | Subject
##         (Intercept) Residual
## StdDev:    1.937002 2.220312
## 
## Fixed effects: distance ~ 1 
##                Value Std.Error DF  t-value p-value
## (Intercept) 24.02315 0.4296606 81 55.91192       0
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -3.2400448 -0.5277439 -0.1072888  0.4731815  2.7687301 
## 
## Number of Observations: 108
## Number of Groups: 27

The between-effect standard deviation is reported as the Residual StdDev. To obtain the ICC, we compute each \(\sigma\):

s2w <- getVarCov(fm1)[[1]]
s2b <- fm1$s^2
c(sigma2_w = s2w, sigma2_b = s2b, icc = s2w/(s2w + s2b))
##  sigma2_w  sigma2_b       icc 
## 3.7519762 4.9297832 0.4321677

With lme4

Using the lme4 package, we fit the model:

fm2 <- lmer(distance ~ (1 | Subject), data = Orthodont)
summary(fm2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: distance ~ (1 | Subject)
##    Data: Orthodont
## 
## REML criterion at convergence: 515.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2400 -0.5277 -0.1073  0.4732  2.7687 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 3.752    1.937   
##  Residual             4.930    2.220   
## Number of obs: 108, groups:  Subject, 27
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  24.0231     0.4297   55.91

The Variance column of the Random Effects table gives the within-subject (Subject) and between-subject (Residual) variances.

s2w <- summary(fm2)$varcor$Subject[1]
s2b <- summary(fm2)$sigma^2
c(sigma2_w = s2w, sigma2_b = s2b, icc = s2w/(s2w + s2b))
##  sigma2_w  sigma2_b       icc 
## 3.7519771 4.9297829 0.4321678