Intra-Class Correlation and Inter-rater Reliability

A few notes on agreement between raters.

Cohen's \(\kappa\)

Cohen's \(\kappa\) can be used for agreement between two raters on categorical data. The basic calculation is

\[ \kappa = \frac{p_a - p_e}{1 - p_e}, \]

where \(p_a\) is the percentage observed agreement and \(p_e\) is the percentage expected agreement by chance. Therefore \(\kappa\) is what percentage of the agreement over chance is observed.

Fleiss' \(\kappa\) is an extension to more than two raters and has a similar form.

A major flaw in either \(\kappa\) is that for ordinal data, any disagreement is treated equal. E.g. on a Likert scale, ratings of 4 and 5 are just as disagreeable as ratings of 1 and 5. Weighted \(\kappa\) addresses this by including a weight matrix which can be used to provide levels of disagreement.


Intra-class correlation

ICC is used for continuous measurements. It can be used in place of weighted \(\kappa\) with ordinal variables of course. The basic calculation is

\[ ICC = \frac{\sigma^2_w}{\sigma^2_w + \sigma^2_b}, \]

where \(\sigma_w^2\) and \(\sigma_b^2\) represent within- and between- rater variability respectively. Since the denominator is the total variance of all ratings regardless of rater, this fraction represents the percent of total variation accounted for by within-variation.

The modern way to estimate the ICC is by a mixed model, extracting the \(\sigma\)'s that are needed.

ICC in R

Use the "Orthodont" data from nlme as our example. Look at distance measurements and look at correlation by Subject.


With nlme

Using the nlme package, we fit the model:

fm1 <- lme(distance ~ 1, random = ~ 1 | Subject, data = Orthodont)
 Linear mixed-effects model fit by REML
  Data: Orthodont
       AIC      BIC    logLik
  521.3618 529.3803 -257.6809

Random effects:
 Formula: ~1 | Subject
        (Intercept) Residual
StdDev:    1.937002 2.220312

Fixed effects:  distance ~ 1
               Value Std.Error DF  t-value p-value
(Intercept) 24.02315 0.4296606 81 55.91192       0

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max
-3.2400448 -0.5277439 -0.1072888  0.4731815  2.7687301

Number of Observations: 108
Number of Groups: 27

The between-effect standard deviation is reported as the Residual StdDev. To obtain the ICC, we compute each \(\sigma\):

s2w <- getVarCov(fm1)[[1]]
s2b <- fm1$s^2
c(sigma2_w = s2w, sigma2_b = s2b, icc = s2w/(s2w + s2b))
 sigma2_w  sigma2_b       icc
3.7519762 4.9297832 0.4321677

With lme4

Using the lme4 package, we fit the model:

fm2 <- lmer(distance ~ (1 | Subject), data = Orthodont)
 Linear mixed model fit by REML ['lmerMod']
Formula: distance ~ (1 | Subject)
   Data: Orthodont

REML criterion at convergence: 515.4

Scaled residuals:
    Min      1Q  Median      3Q     Max
-3.2400 -0.5277 -0.1073  0.4732  2.7687

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 3.752    1.937
 Residual             4.930    2.220
Number of obs: 108, groups:  Subject, 27

Fixed effects:
            Estimate Std. Error t value
(Intercept)  24.0231     0.4297   55.91

The Variance column of the Random Effects table gives the within-subject (Subject) and between-subject (Residual) variances.

s2w <- summary(fm2)$varcor$Subject[1]
s2b <- summary(fm2)$sigma^2
c(sigma2_w = s2w, sigma2_b = s2b, icc = s2w/(s2w + s2b))
 sigma2_w  sigma2_b       icc
3.7519736 4.9297839 0.4321675