Inter-rater reliability under per-reviewer normalization
By Ryan Siegal · Founder and Principal
1. What per-reviewer normalization does to inter-rater disagreement
A classical decomposition of a review dataset splits observed variance into three sources: between-product variance (σ²P), between-reviewer variance (σ²R, also called the "reviewer main effect"), and residual variance (σ²e).
Observed rating: r_{u,i} = μ + α_u + β_i + ε_{u,i}
μ = grand mean
α_u = reviewer main effect (reviewer u's personal offset)
β_i = product true quality (the signal we want)
ε_{u,i} = residual (noise + reviewer-by-product interaction)
Variance decomposition:
Var(r) = σ²_R + σ²_P + σ²_ePer-reviewer z-score normalization subtracts each reviewer's personal mean and divides by their personal SD. That operation:
- Zeros out αu: subtracting μu removes the reviewer's offset.
- Standardizes σu: dividing by σu forces every reviewer to unit variance.
- Preserves βi: the product-quality signal — the thing we actually care about — is the only remaining systematic effect.
What's left in the z-scored dataset is just between-product variance plus residual noise. ICC computed on this residual captures agreement about product quality, not agreement about scale usage.
2. Reviewer-level ICC(1,1) in the z-scale
For z-scored data zu,i, the one-way random-effects single-measurement intraclass correlation is:
ICC(1,1) = σ²_P / (σ²_P + σ²_e)
where σ²_P and σ²_e are estimated from the ANOVA decomposition of z_{u,i}.
Interpretation (Koo & Li, 2016):
ICC < 0.50 → poor reliability (category is genuinely polarizing)
0.50–0.75 → moderate reliability (typical for luxury wine, arthouse film)
0.75–0.90 → good reliability (typical for consumer electronics, hotels)
ICC ≥ 0.90 → excellent reliability (rare; only in very objective categories)3. What ICC tells us about a category
| Luxury wine (Napa Cab, Bordeaux classed growths) | ICC ≈ 0.55–0.65. Moderate. Reviewers agree broadly on the top and bottom but disagree substantially on the middle — which is exactly where stylistic palate differences live. |
|---|---|
| Mass-market consumer wine | ICC ≈ 0.70. Higher than luxury because stylistic nuance matters less; the question is essentially "is this drinkable?" |
| Arthouse film | ICC ≈ 0.40–0.55. Poor to moderate. Critics frequently disagree on artistic merit; the category is genuinely polarizing. |
| Consumer electronics (laptops, headphones) | ICC ≈ 0.75–0.85. Good. Objective criteria (battery life, sound quality under standardized tests) drive most of the variance. |
| Hotels (luxury tier) | ICC ≈ 0.80. Good. Professional editorial panels like Michelin and Forbes apply similar rigor. |
| Restaurants (casual tier) | ICC ≈ 0.65. Moderate. Crowd reviewers and professional critics agree less; stylistic cuisine preferences create systematic disagreement. |
These ICCs are computed on our cross-reviewer z-scored data, which is why they differ from published source-pair ICCs (the older approach that compared Wine Spectator and Wine Advocate source means directly). Reviewer-level ICCs are lower than source-level ICCs because per-reviewer normalization exposes individual-critic disagreement that source-level averaging hides.
4. Why we publish R1, R2, and R3 — it's an ICC response
When reviewer-level ICC is high, R1 (pure relative) is a reliable single number. When ICC is low, R1 still provides the crowd's consensus but misses the possibility that the professionals are right and the crowd is wrong. R2 (source-weighted) tilts toward more credible sources — useful specifically in low-ICC regimes where the disagreement is informative.
In practice, we watch for products where R1 and R2 diverge by more than ~10 percentile points. That's a signal to the reader: reviewers disagree systematically, the product is stylistically polarizing, and the decision depends on whose palate aligns with yours. Our taglines surface this directly ("Professional critics rate this far above the crowd consensus").
5. Cohen's κ for binary rater decisions
Some reviewers produce only binary decisions (Rotten Tomatoes fresh/rotten, Michelin star/no-star). Z-score normalization doesn't directly apply to a single binary observation. For those reviewers we use pairwise Cohen's κ to compute agreement on shared products:
κ = (p_o − p_e) / (1 − p_e)
p_o = observed agreement rate on binary decisions
p_e = agreement rate expected by chance
Interpretation (Landis & Koch, 1977):
κ < 0.00 → poor
0.00 – 0.20 → slight
0.21 – 0.40 → fair
0.41 – 0.60 → moderate
0.61 – 0.80 → substantial
0.81 – 1.00 → almost perfectBinary reviewers get admitted to R1/R2 only when paired with at least two other binary decisions (so they have a personal "mean" of 0 or 1 and a non-zero σ). Their z-scores saturate at ±1, which implicitly down-weights them relative to continuous-scale reviewers — a feature, not a bug.
6. How low reviewer-level ICC calibrates R2 weights
R2's source weights are not pulled from nothing. Our calibration loop looks at the contribution of each sourceto the reviewer-level ICC when it's added to the pool. Sources that raise ICC (their reviewers agree with the broader consensus once normalized) earn higher ws. Sources that lower ICC (their reviewers introduce independent signal) get a nuanced treatment: they contribute information orthogonal to the pool, but at the cost of agreement. We weight them lower than their raw rigor would suggest.
This loop — ICC-informed weight calibration — runs quarterly. Any weight change produces a public methodology version bump in the changelog. Historical percentiles remain visible at their original weights.
Frequently asked questions
Doesn't per-reviewer normalization hide disagreement?+
Why is reviewer-level ICC lower than source-level ICC?+
How do we handle binary reviewers like Rotten Tomatoes critics?+
Do ICCs update over time?+
What if a category has ICC near zero — should we still publish percentiles?+
← Confidence intervals · Back to Theory overview