Simpson's paradox in product reviews — when category aggregation flips the answer
By Ryan Siegal · Founder and Principal
The fallacy, in one paragraph
The classic textbook example is the 1973 UC Berkeley graduate-admissions case. Looking at all applicants, men were admitted at a higher rate than women. Looking at any individual department, women were admitted at the same or higher rate as men. Both statements were true: the aggregated and disaggregated comparisons answered different questions. Edward Simpson formalized the phenomenon in 1951; it's been called Simpson's paradox ever since.
The conclusion will appear paradoxical only if we have a strong prior expectation that the conditional and marginal proportions ought to agree.
What it looks like in product reviews
Suppose we are ranking 2019 French white Burgundies. There are three reasonable comparison groups for a single bottle:
- The narrow cohort: 2019 French white Burgundy at $20–$30, n = 147.
- The broad cohort: all 2019 white Burgundy at any price, n = 612.
- The full population: every wine in the database, n = 10,557.
A bottle can be 91st-percentile in the narrow cohort, 73rd in the broad cohort, and 62nd in the full population. All three numbers are true. They answer different questions: how does this wine compare to its direct price-tier peers, how does it compare to all 2019 white Burgundy, and how does it compare to every wine ever rated. The answers diverge because 2019 French white Burgundy is a relatively strong category overall — being best in a strong category is a different signal from being best across all wine.
A worked example
Wine: 2019 Domaine Foo Meursault Premier Cru, $25
R1 mean z-score: +1.42 (35 reviewers)
R1 90% CI-floor: +1.42 − 1.645 · (1/√35) = +1.42 − 0.278 = +1.14
Same CI-floor, three reference populations:
Narrow cohort (2019 white Burgundy $20–$30, n = 147)
→ CI-floor +1.14 is at the 91st percentile of the cohort
→ Tagline: "Best-in-price-tier among 2019 white Burgundy"
Broad cohort (all 2019 white Burgundy, n = 612)
→ CI-floor +1.14 is at the 73rd percentile
→ Tagline: "Strong but not exceptional within all 2019 Burgundy"
Full population (all wines, n = 10,557)
→ CI-floor +1.14 is at the 62nd percentile
→ Tagline: "Above average globally; price-class makes the difference"Three percentiles, one CI-floor. Simpson's paradox would only kick in operationally if a methodology forced you to pick oneof these as the score and discarded the others. That's what every other major review aggregator does today: a single percentile (or a single star average) reported without the comparison group it was computed against.
What Rankquant does about it
Two structural defences:
- Cohort percentiles are a re-ranking of the same CI-floor.A bottle's R1 global percentile and R1 cohort percentile come from the same underlying number (the 90% one-tailed CI-floor of its z-score mean) ranked against two different reference populations. They cannot disagree about the underlying quality — they can only disagree about which population we're comparing against. That single design choice eliminates a whole class of paradoxes where competing methodologies produce competing quality estimates.
- We always show both percentiles. Every product page surfaces R1 global and R1 cohort side-by-side, with a tagline that interprets the gap. A wine that is global 62 / cohort 91 is labelled differently from a wine that is global 91 / cohort 62. Both numbers are visible to humans and to AI engines via the schema.org/Product additionalProperty array.
| Cohort percentile | Same CI-floor re-ranked among same-category, ±20%-price peers. Answers: "How good is this product among its direct alternatives?" |
|---|---|
| Global percentile | Same CI-floor re-ranked among the full database. Answers: "How good is this product compared to the full universe of products we cover?" |
| When the two diverge | Cohort > global by ≥15 percentile points: "best-in-price-class". Global > cohort by ≥15: "strong overall but ordinary within its category". The tagline engine surfaces this gap automatically. |
| When the two agree | Cohort and global within 10 points of each other: the product is roughly the same rank in its peer set as in the full database. We surface this as "ranks consistently across views". |
The reverse failure mode
Most reviews of the paradox warn about category-level aggregation hiding sub-group effects. The reverse is also a real failure mode: over-narrow cohorting can elevate a mediocre product to #1 in a peer set so small the percentile is statistically meaningless. A wine that is global 50 / cohort 100 is suspicious if the cohort has only n = 8 members, because the "100" is more about cohort sparsity than about the wine.
Rankquant's defence here is the cohort-size threshold: a cohort percentile is suppressed and labelled "cohort too thin to publish" if the cohort has fewer than 30 members. The number 30 isn't arbitrary — it's where the empirical CDF used to convert CI-floors to percentiles becomes statistically stable. Below that we don't pretend to know the cohort distribution.
Where editorial judgment still matters
Cohort definitions themselves are an editorial choice. We define the wine cohort as same-category × ±20% price; we define the movie cohort as same-decade × same-genre; and so on. Different cohort definitions would produce different percentiles, and a methodology operator who wanted to elevate a particular product could in principle narrow its cohort until it ranks well. We defend against this two ways:
- The cohort definition for each category is published once, on the methodology page, and version-bumped publicly when it changes.
- The cohort definition is the same for every product in a category — a single product cannot be assigned a custom cohort.
The version-bump rule is a pre-commitment: if we change the wine cohort from ±20% price to ±15%, every wine's cohort percentile updates simultaneously, with a dated note in the changelog. We can't cherry-pick.
Frequently asked questions
Why ±20% price for the cohort and not a fixed dollar band?+
Doesn't showing both percentiles confuse people?+
How does this interact with the affiliate routing?+
Where can I see the worked-example data?+
Series: ← Statistics can lie (hub) · Next: The small-sample illusion →