Inter-rater reliability under per-reviewer normalization

By Ryan Siegal · Founder and Principal

Published 2026-04-24

1. What per-reviewer normalization does to inter-rater disagreement

A classical decomposition of a review dataset splits observed variance into three sources: between-product variance (σ²_P), between-reviewer variance (σ²_R, also called the "reviewer main effect"), and residual variance (σ²_e).

Observed rating:   r_{u,i}  =  μ  +  α_u  +  β_i  +  ε_{u,i}

  μ        = grand mean
  α_u      = reviewer main effect    (reviewer u's personal offset)
  β_i      = product true quality    (the signal we want)
  ε_{u,i}  = residual (noise + reviewer-by-product interaction)

Variance decomposition:
  Var(r)  =  σ²_R  +  σ²_P  +  σ²_e

Per-reviewer z-score normalization subtracts each reviewer's personal mean and divides by their personal SD. That operation:

Zeros out α_u: subtracting μ_u removes the reviewer's offset.
Standardizes σ_u: dividing by σ_u forces every reviewer to unit variance.
Preserves β_i: the product-quality signal — the thing we actually care about — is the only remaining systematic effect.

What's left in the z-scored dataset is just between-product variance plus residual noise. ICC computed on this residual captures agreement about product quality, not agreement about scale usage.

2. Reviewer-level ICC(1,1) in the z-scale

For z-scored data z_u,i, the one-way random-effects single-measurement intraclass correlation is:

ICC(1,1)  =  σ²_P  /  (σ²_P  +  σ²_e)

where σ²_P and σ²_e are estimated from the ANOVA decomposition of z_{u,i}.

Interpretation (Koo & Li, 2016):
  ICC < 0.50    →  poor reliability        (category is genuinely polarizing)
  0.50–0.75     →  moderate reliability    (typical for luxury wine, arthouse film)
  0.75–0.90     →  good reliability        (typical for consumer electronics, hotels)
  ICC ≥ 0.90    →  excellent reliability   (rare; only in very objective categories)

3. What ICC tells us about a category

Typical reviewer-level ICC(1,1) magnitudes observed in published analyses of z-normalized review data.
Luxury wine (Napa Cab, Bordeaux classed growths)	ICC ≈ 0.55–0.65. Moderate. Reviewers agree broadly on the top and bottom but disagree substantially on the middle — which is exactly where stylistic palate differences live.
Mass-market consumer wine	ICC ≈ 0.70. Higher than luxury because stylistic nuance matters less; the question is essentially "is this drinkable?"
Arthouse film	ICC ≈ 0.40–0.55. Poor to moderate. Critics frequently disagree on artistic merit; the category is genuinely polarizing.
Consumer electronics (laptops, headphones)	ICC ≈ 0.75–0.85. Good. Objective criteria (battery life, sound quality under standardized tests) drive most of the variance.
Hotels (luxury tier)	ICC ≈ 0.80. Good. Professional editorial panels like Michelin and Forbes apply similar rigor.
Restaurants (casual tier)	ICC ≈ 0.65. Moderate. Crowd reviewers and professional critics agree less; stylistic cuisine preferences create systematic disagreement.

Typical reviewer-level ICC(1,1) magnitudes observed in published analyses of z-normalized review data.

These ICCs are computed on our cross-reviewer z-scored data, which is why they differ from published source-pair ICCs (the older approach that compared Wine Spectator and Wine Advocate source means directly). Reviewer-level ICCs are lower than source-level ICCs because per-reviewer normalization exposes individual-critic disagreement that source-level averaging hides.

4. Why we publish R1, R2, and R3 — it's an ICC response

When reviewer-level ICC is high, R1 (pure relative) is a reliable single number. When ICC is low, R1 still provides the crowd's consensus but misses the possibility that the professionals are right and the crowd is wrong. R2 (source-weighted) tilts toward more credible sources — useful specifically in low-ICC regimes where the disagreement is informative.

In practice, we watch for products where R1 and R2 diverge by more than ~10 percentile points. That's a signal to the reader: reviewers disagree systematically, the product is stylistically polarizing, and the decision depends on whose palate aligns with yours. Our taglines surface this directly ("Professional critics rate this far above the crowd consensus").

5. Cohen's κ for binary rater decisions

Some reviewers produce only binary decisions (Rotten Tomatoes fresh/rotten, Michelin star/no-star). Z-score normalization doesn't directly apply to a single binary observation. For those reviewers we use pairwise Cohen's κ to compute agreement on shared products:

κ  =  (p_o − p_e) / (1 − p_e)

  p_o = observed agreement rate on binary decisions
  p_e = agreement rate expected by chance

Interpretation (Landis & Koch, 1977):
  κ < 0.00     →  poor
  0.00 – 0.20  →  slight
  0.21 – 0.40  →  fair
  0.41 – 0.60  →  moderate
  0.61 – 0.80  →  substantial
  0.81 – 1.00  →  almost perfect

Binary reviewers get admitted to R1/R2 only when paired with at least two other binary decisions (so they have a personal "mean" of 0 or 1 and a non-zero σ). Their z-scores saturate at ±1, which implicitly down-weights them relative to continuous-scale reviewers — a feature, not a bug.

6. How low reviewer-level ICC calibrates R2 weights

R2's source weights are not pulled from nothing. Our calibration loop looks at the contribution of each sourceto the reviewer-level ICC when it's added to the pool. Sources that raise ICC (their reviewers agree with the broader consensus once normalized) earn higher w_s. Sources that lower ICC (their reviewers introduce independent signal) get a nuanced treatment: they contribute information orthogonal to the pool, but at the cost of agreement. We weight them lower than their raw rigor would suggest.

This loop — ICC-informed weight calibration — runs quarterly. Any weight change produces a public methodology version bump in the changelog. Historical percentiles remain visible at their original weights.

Frequently asked questions

Doesn't per-reviewer normalization hide disagreement?+

No — it hides scale-disagreement while exposing quality-disagreement. Two reviewers who use different 1–5 scales but agree on which products are best will produce identical z-scores. Two reviewers who genuinely disagree about a specific product produce different z-scores for that product, and their disagreement shows up directly as residual variance. ICC computed on the z-scored data captures exactly this quality disagreement.

Why is reviewer-level ICC lower than source-level ICC?+

Source-level ICC averages over many critics at each source, which cancels out individual-critic disagreement. Reviewer-level ICC exposes that individual disagreement. The source-level numbers look better, but they overstate agreement because they hide the intra-source variance.

How do we handle binary reviewers like Rotten Tomatoes critics?+

They're admitted to R1/R2 only if they've made at least two binary decisions (so μ_u and σ_u are defined). Their z-scores saturate at ±1 roughly, which gives them less leverage than continuous-scale reviewers. For products with only binary data, the aggregate effectively becomes a Wilson-score-adjusted fresh-rate.

Do ICCs update over time?+

Yes. As the reviewer pool grows, ICC estimates stabilize. Quarterly we recompute category-level ICCs and adjust R2 source weights based on how each source contributes to inter-reviewer agreement. Historical percentiles are preserved at their computation-date weights.

What if a category has ICC near zero — should we still publish percentiles?+

We publish them, but we flag the category as "low inter-rater reliability" on the category page and give R2 extra visual weight. Near-zero ICC means reviewers functionally disagree; the best we can do is report the crowd consensus (R1) and the credible-source consensus (R2) separately and let the reader choose.

← Confidence intervals · Back to Theory overview