Intra-Class Correlation and Inter-rater Reliability

A few notes on agreement between raters.

Cohen's \(\kappa\)

Cohen's \(\kappa\) can be used for agreement between two raters on categorical data. The basic calculation is

\[ \kappa = \frac{p_a - p_e}{1 - p_e}, \]

where \(p_a\) is the percentage observed agreement and \(p_e\) is the percentage expected agreement by chance. Therefore \(\kappa\) is what percentage of the agreement over chance is observed.

Fleiss' \(\kappa\) is an extension to more than two raters and has a similar form.

A major flaw in either \(\kappa\) is that for ordinal data, any disagreement is treated equal. E.g. on a Likert scale, ratings of 4 and 5 are just as disagreeable as ratings of 1 and 5. Weighted \(\kappa\) addresses this by including a weight matrix which can be used to provide levels of disagreement.

Sources

Intra-class correlation

ICC is used for continuous measurements. It can be used in place of weighted \(\kappa\) with ordinal variables of course. The basic calculation is

\[ ICC = \frac{\sigma^2_w}{\sigma^2_w + \sigma^2_b}, \]

where \(\sigma_w^2\) and \(\sigma_b^2\) represent within- and between- rater variability respectively. Since the denominator is the total variance of all ratings regardless of rater, this fraction represents the percent of total variation accounted for by within-variation.

The modern way to estimate the ICC is by a mixed model, extracting the \(\sigma\)'s that are needed.

ICC in R

Use the "Orthodont" data from nlme as our example. Look at distance measurements and look at correlation by Subject.

library("nlme")
library("lme4")
data(Orthondont)

With nlme

Using the nlme package, we fit the model:

fm1 <- lme(distance ~ 1, random = ~ 1 | Subject, data = Orthodont)
summary(fm1)
 Linear mixed-effects model fit by REML
  Data: Orthodont
       AIC      BIC    logLik
  521.3618 529.3803 -257.6809

Random effects:
 Formula: ~1 | Subject
        (Intercept) Residual
StdDev:    1.937002 2.220312

Fixed effects:  distance ~ 1
               Value Std.Error DF  t-value p-value
(Intercept) 24.02315 0.4296606 81 55.91192       0

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max
-3.2400448 -0.5277439 -0.1072888  0.4731815  2.7687301

Number of Observations: 108
Number of Groups: 27

The between-effect standard deviation is reported as the Residual StdDev. To obtain the ICC, we compute each \(\sigma\):

s2w <- getVarCov(fm1)[[1]]
s2b <- fm1$s^2
c(sigma2_w = s2w, sigma2_b = s2b, icc = s2w/(s2w + s2b))
 sigma2_w  sigma2_b       icc
3.7519762 4.9297832 0.4321677

With lme4

Using the lme4 package, we fit the model:

fm2 <- lmer(distance ~ (1 | Subject), data = Orthodont)
summary(fm2)
 Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ (1 | Subject)
   Data: Orthodont

REML criterion at convergence: 515.4

Scaled residuals:
    Min      1Q  Median      3Q     Max
-3.2400 -0.5277 -0.1073  0.4732  2.7687

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 3.752    1.937
 Residual             4.930    2.220
Number of obs: 108, groups:  Subject, 27

Fixed effects:
            Estimate Std. Error t value
(Intercept)  24.0231     0.4297   55.91

The Variance column of the Random Effects table gives the within-subject (Subject) and between-subject (Residual) variances.

s2w <- summary(fm2)$varcor$Subject[1]
s2b <- summary(fm2)$sigma^2
c(sigma2_w = s2w, sigma2_b = s2b, icc = s2w/(s2w + s2b))
 sigma2_w  sigma2_b       icc
3.7519736 4.9297839 0.4321675

Sources

This work is licensed under CC BY-NC 4.0 Creative Commons BY-NC image