While the formula looks different from what I show
on my page, I believe the result is the same. As you
probably know, raw percent agreement may overstate
the amount of agreement due to chance agreement, so
modification to this measure may be useful such as
the one offered by Fleiss.
To demonstrate that the calculations I provide agree
with Fleiss' overall agreement measure, you can use this
page which also provides better measures of agreement:
https://mlnl.net/jg/software/ira/
I took my example on page 5 and imputed the 3 raters' scores
as a text file with the following content:
Reviewer1
Reviewer2 Reviewer3
1 1 1
2 2 2
3 3 3
2 3 3
1 1 1
2 3 1
2 2 1
1 1 1
2 1 1
1 1 1
2 2 2
3 3 3
1 1 1
1 1 2
2 2 2
2 2 2
1 1 1
The page above provided the following calculations as a result:
Data
3 raters and 18 cases
1 variable with 54 decisions in total
no missing data
1: rra
Fleiss |
Krippendorff |
Pairwise
avg. |
A_obs =
0.741 A_exp = 0.372 Kappa
= 0.587
|
D_obs =
0.259 D_exp = 0.639 Alpha = 0.595
|
% agr =
74.1 Kappa = 0.592 |
You can see Fleiss A_obs at .741 is within rounding error of the
74.07% I show on my page (ignoring differences between proportion
and percentage), and the % agreement of 74.1 under pairwise average
is also the same.
Krippendorff's alpha seems to be a popular measure to use for multiple
raters.
Personally, if I were reporting level of agreement among raters, I would
report the mean level of agreement (74.1%), Fleiss
kappa
(.587), and Krippendorff alpha (.595) then let readers decide which they
wish to believe.
Good luck