Coder Agreement

Topic: Reliability/Agreement among Coders: Rater Agreement

1. Presentations and Videos

Document: Assessing Coder/Rater Agreement for Nominal Data (Video: Assessing Rater Agreement for Nominal Data) (Notes minor update, not in video)
Document :Assessing Coder/Rater Agreement for Ordinal and Interval/Ratio Data (Video: Assessing Rater Agreement for Ordinal and Interval/Ratio Data)

2. Readings

Supplemental Inter-coder Agreement
- Viera & Garrett (2005) Understanding Interobserver Agreement: The Kappa Statistic.
- Joyce (2013) Blog Entry: Picking the Best Intercoder Reliability Statistic for Your Digital Activism Content Analysis (PDF version of page)

Note - ignore material below - not yet incorporated into EDUR 9131

Read
Supplemental

Instructor Note: Other material to consider

Email explaining percentage agreement, Fleiss kappa, Krippendorff alpha

http://www.kenbenoit.net/courses/tcd2014qta/readings/Banerjee%20et%20al%201999_Beyond%20kappa.pdf

Krippendorff alpha details http://repository.upenn.edu/cgi/viewcontent.cgi?article=1286&context=asc_papers

http://www.agreestat.com/research_papers/wiley_encyclopedia2008_eoct631.pdf

Fleiss kappa http://en.wikipedia.org/wiki/Fleiss'_kappa

Fleiss kappa 1971 introduction: http://www.bwgriffin.com/gsu/courses/edur9131/content/Fleiss_kappa_1971.pdf (shows formula for 2+ rater agreement as noted in my email link

Liao, Hunt, & Chen (2010). Comparison between Inter-rater Reliability and Inter-rater Agreement in Performance Assessment. Illustrates how high reliability does not equal to high agreement, and how high agreement does not equal to high reliability.

They claim data in Table 2 show high agreement but low consistency. The are incorrect, agreement indices are very low. K alpha = -.29 (ordinal, interval, and ratio), and ICC for agreement = -.38 single and -5.00 multiple raters. Moreover subject means in Table 2 show almost 2 SD difference, large ES difference.

Artstein and Poesio (2008) Inter-Coder Agreement for Computational Linguistics. Excellent detailed source for explaining inter-coder agreement measures including K alpha -- good review source.

(c) Reliability -- inter-rater and intra-rater agreement

Rating Scales are Categorical/Nominal (angry, fearful, contempt, disgust) or Few Ranked Categories (poor, good, very good)

Presentation notes: Inter-rater Agreement with Nominal/Categorical Ratings with SPSS commands

Graham, Milanowski, & Miller (2012). Measuring and Promoting Inter-Rater Agreement of Teacher and Principal Performance Ratings.

See their Table 1 for a nice illustration of the difference between reliability and agreement. Pearson r vs. percent agreement: r may be 1.00 despite no agreement.

Percentage Agreement

Simple count of "agreement codes" / "total codes" forms percentage agreement

Percentage agreement can be calculated for more than two raters

Generally percentage agreement is not recommended since it can capitalize on chance agreements, thus it may overstate the level of agreement, but do report percent agreement.

Cohen's Kappa

Useful for two raters with unordered rating scale or few ordered options

Discussion: Viera & Garrett 2005; present table for interpreting kappa; also note limitations of kappa

Example of kappa use: Uiters et al 2006, see page 4 and table

SPSS

Requires symmetrical scores: all codes must be present for each rater (Rater A uses 1 to 5, Rater B uses 1 to 4, no kappa) - update, new versions of SPSS correct this

Likely not useful if number of categories is large (many themes to coding responses, good for overall judgments with limited codes [excellent, pass, fail])

SPSS notes

http://www.stattutorials.com/SPSS/TUTORIAL-SPSS-Interrater-Reliability-Kappa.htm

http://www.sma.org.sg/smj/4412/4412bs1.pdf -- page 617

(problem, all categories must be present for both raters): http://www.ats.ucla.edu/stat/spss/faq/kappa.htm (can 3+ raters be used?)

Scott's pi (or Fleiss Kappa for two raters)

Same formula as Cohen's Kappa but calculates expected disagreement differently; often results are similar

Excel file to calculate Fleiss Generalized Kappa: http://www.bwgriffin.com/gsu/courses/edur9131/content/fleiss_kappa2.xls (Note, spreadsheet not working)

Original link: http://www.ccitonline.org/jking/homepage/interrater.html (Excel files for Fleiss kappa; large excel may be problematic)

How to calculate Fleiss kappa in Excel: http://www.real-statistics.com/reliability/fleiss-kappa/

Krippendorff's alpha

Krippendorff agues

alpha superior to Kappa and pi

serious work should see alpha > .80

can handle missing data when raters are 3+ unlike kappa and pi

SPSS Syntax for running Krippendorff alpha

Hayes' website: http://www.afhayes.com/public/kalpha.sps

Copied syntax here: SPSS syntax

Spring 2015 students noted errors with SPSS version 21 and 22; check syntax

Knut De Swert (2012). Calculating inter-coder reliability in media content analysis using Krippendorff’s Alpha. University of Amsterdam. Explains how to use Hayes' SPSS syntax to run K alpha, and meaning of values of K alpha. Provides examples.

Three or More Raters

All of the agreement indices noted above can be extended to more than two raters

See Inter-rater Agreement with Nominal/Categorical Ratings for examples and instructions

On-line Reliability Calculators:

Geertzen, J. (2012). Inter-Rater Agreement with multiple raters and variables. Retrieved February 20, 2015, from https://mlnl.net/jg/software/ira/

Deen Freelon (2015) ReCal: reliability calculation for the masses

Both report (a) mean percentage agreement, (b) mean kappa, (c) Scott pi or Fleiss kappa, (d) Krippendorff alpha

Material to Add

Index of concordance = A / (A+D) where A=agreement and D=disagreement

Replicate Table 3 to show problem with Cohen/Fleiss Kappa: Joyce (2013) Blog Entry: Picking the Best Intercoder Reliability Statistic for Your Digital Activism Content Analysis (PDF version of page)

Rating Scales are Ordinal with Several Steps, or Interval/Ratio Scale

Presentation notes: To be Revised Inter-rater Agreement with Ranked/Interval Data (revisions needed, ICC absolute agreement vs. consistency -- AA is same score, consistency is similar scores)

Two Raters

Agreement vs. Consistency

Are scores similar between raters: Question of Agreement

Is the pattern of scores similar between raters: Question of Consistency

See Inter-rater Agreement with Ranked/Interval Data for illustration of this difference.

Liao, Hunt, & Chen (2010). Comparison between Inter-rater Reliability and Inter-rater Agreement in Performance Assessment.

p 613 "Inter-rater agreement and inter-rater reliability are both important for PA. The former shows stability of scores a student receives from different raters, while the latter shows the consistence of scores across different students from different raters."

Not sure I agree with this statement. Consistency does not show consistency of scores across different cases. Scores can be widely different but perfectly consistent.

Update: They claim data in Table 2 show high agreement but low consistency. The are incorrect, agreement indices are very low. K alpha = -.29 (ordinal, interval, and ratio), and ICC for agreement = -.38 single and -5.00 multiple raters. Moreover subject means in Table 2 show almost 2 SD difference, large ES difference.

Is it possible to have high agreement but low consistency?

Pearson r and Correlated t-test

Can be used, but Pearson r is not a measure of agreement

Cronbach's Weighted Kappa for ordinal data

gamma? others?

Krippendorff's Alpha for ordinal, interval, and ratio data

Hayes' website: http://www.afhayes.com/public/kalpha.sps

Copied syntax here: SPSS syntax

Spring 2015 students noted errors with SPSS version 21 and 22; check syntax

Intra-class Correlation Coefficient

Absolute Agreement (scores match) vs. Consistency (patterns match)

Multiple raters vs. single rater: multiple for studies that use scores combined across raters, use single for studies using scores from a single rater

Sources:

Hallgren (2012). Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial. Tutor Quant Methods Psychol.

David P. Nichols (1998) Choosing an intraclass correlation coefficient (in SPSS). Explains different model options, difference between consistency and absolute agreement, and between single measure vs. average measure.

Yaffee (1998) Enhancement of Reliability Analysis: Application of Intraclass Correlations with SPSS/Windows v.8. Explains differences in ICC provided by SPSS and gives examples.

Notes:

Wuensch (2014). Inter-rater Agreement. SPSS instructions with discussion.

Krippendorff Alpha for ordinal data provides much better assessment than nominal measures (kappa, etc.) whenever there is any order to data. Source: Antoine, Villaneau, Lefeuvre. (2014) Weighted Krippendorff 's alpha is a more reliable metrics for multi- coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation.

Three or More Raters

Krippendorff's alpha

Key benefit is ability to handle missing data from multiple raters

Handles ordinal, interval, and ratio data

Intra-class Correlation

SPSS instructions and discussion: Wuensch (2013) The Intraclass Correlation Coefficient.

Cronbach's alpha

Not a measure of agreement, instead measure of pattern consistency

Cronbach's alpha = ICC for multiple raters using consistency, but not for absolute agreement

Can be used as aggregate measure of consistency across raters if mean score is to be used

Source: Hayes & Krippendorff (2007; Answering the Call for a Standard Reliability Measure for Coding Data) explain this p. 81

Other Measures of Agreement/Consistency

Many measures exist. Some discussion can be found here:

Barnhart et a. (2014). Choice of agreement indices for assessing and improving measurement reproducibility in a core laboratory setting

Banerjee, Capozzoli, McSweeney, and Sinha (1999). Beyond kappa: A review of interrater agreement measures. Reviews a number of agreement measures.

Euclidean coefficients -- https://conservancy.umn.edu/bitstream/handle/11299/114459/v15n4p321.pdf?sequence=1

Loglinear models to examine patterns of agreement.

Latent Class models

Bennet's sigma

Gwet's gamma

Aickin alpha

Concordance correlation coefficient (CCC)