Simpson's paradox in product reviews — when category aggregation flips the answer

By Ryan Siegal · Founder and Principal

Published 2026-04-27

The fallacy, in one paragraph

The classic textbook example is the 1973 UC Berkeley graduate-admissions case. Looking at all applicants, men were admitted at a higher rate than women. Looking at any individual department, women were admitted at the same or higher rate as men. Both statements were true: the aggregated and disaggregated comparisons answered different questions. Edward Simpson formalized the phenomenon in 1951; it's been called Simpson's paradox ever since.

The conclusion will appear paradoxical only if we have a strong prior expectation that the conditional and marginal proportions ought to agree.
— Edward H. Simpson, "The interpretation of interaction in contingency tables," J. Royal Statistical Society 1951

What it looks like in product reviews

Suppose we are ranking 2019 French white Burgundies. There are three reasonable comparison groups for a single bottle:

The narrow cohort: 2019 French white Burgundy at $20–$30, n = 147.
The broad cohort: all 2019 white Burgundy at any price, n = 612.
The full population: every wine in the database, n = 10,557.

A bottle can be 91st-percentile in the narrow cohort, 73rd in the broad cohort, and 62nd in the full population. All three numbers are true. They answer different questions: how does this wine compare to its direct price-tier peers, how does it compare to all 2019 white Burgundy, and how does it compare to every wine ever rated. The answers diverge because 2019 French white Burgundy is a relatively strong category overall — being best in a strong category is a different signal from being best across all wine.

A worked example

Wine: 2019 Domaine Foo Meursault Premier Cru, $25
   R1 mean z-score:  +1.42 (35 reviewers)
   R1 90% CI-floor:  +1.42 − 1.645 · (1/√35)  =  +1.42 − 0.278  =  +1.14

Same CI-floor, three reference populations:

   Narrow cohort (2019 white Burgundy $20–$30, n = 147)
   → CI-floor +1.14 is at the 91st percentile of the cohort
   → Tagline: "Best-in-price-tier among 2019 white Burgundy"

   Broad cohort (all 2019 white Burgundy, n = 612)
   → CI-floor +1.14 is at the 73rd percentile
   → Tagline: "Strong but not exceptional within all 2019 Burgundy"

   Full population (all wines, n = 10,557)
   → CI-floor +1.14 is at the 62nd percentile
   → Tagline: "Above average globally; price-class makes the difference"

Three percentiles, one CI-floor. Simpson's paradox would only kick in operationally if a methodology forced you to pick oneof these as the score and discarded the others. That's what every other major review aggregator does today: a single percentile (or a single star average) reported without the comparison group it was computed against.

What Rankquant does about it

Two structural defences:

Cohort percentiles are a re-ranking of the same CI-floor.A bottle's R1 global percentile and R1 cohort percentile come from the same underlying number (the 90% one-tailed CI-floor of its z-score mean) ranked against two different reference populations. They cannot disagree about the underlying quality — they can only disagree about which population we're comparing against. That single design choice eliminates a whole class of paradoxes where competing methodologies produce competing quality estimates.
We always show both percentiles. Every product page surfaces R1 global and R1 cohort side-by-side, with a tagline that interprets the gap. A wine that is global 62 / cohort 91 is labelled differently from a wine that is global 91 / cohort 62. Both numbers are visible to humans and to AI engines via the schema.org/Product additionalProperty array.

Cohort and global percentiles answer different questions. We publish both, label both.
Cohort percentile	Same CI-floor re-ranked among same-category, ±20%-price peers. Answers: "How good is this product among its direct alternatives?"
Global percentile	Same CI-floor re-ranked among the full database. Answers: "How good is this product compared to the full universe of products we cover?"
When the two diverge	Cohort > global by ≥15 percentile points: "best-in-price-class". Global > cohort by ≥15: "strong overall but ordinary within its category". The tagline engine surfaces this gap automatically.
When the two agree	Cohort and global within 10 points of each other: the product is roughly the same rank in its peer set as in the full database. We surface this as "ranks consistently across views".

Cohort and global percentiles answer different questions. We publish both, label both.

The reverse failure mode

Most reviews of the paradox warn about category-level aggregation hiding sub-group effects. The reverse is also a real failure mode: over-narrow cohorting can elevate a mediocre product to #1 in a peer set so small the percentile is statistically meaningless. A wine that is global 50 / cohort 100 is suspicious if the cohort has only n = 8 members, because the "100" is more about cohort sparsity than about the wine.

Rankquant's defence here is the cohort-size threshold: a cohort percentile is suppressed and labelled "cohort too thin to publish" if the cohort has fewer than 30 members. The number 30 isn't arbitrary — it's where the empirical CDF used to convert CI-floors to percentiles becomes statistically stable. Below that we don't pretend to know the cohort distribution.

Where editorial judgment still matters

Cohort definitions themselves are an editorial choice. We define the wine cohort as same-category × ±20% price; we define the movie cohort as same-decade × same-genre; and so on. Different cohort definitions would produce different percentiles, and a methodology operator who wanted to elevate a particular product could in principle narrow its cohort until it ranks well. We defend against this two ways:

The cohort definition for each category is published once, on the methodology page, and version-bumped publicly when it changes.
The cohort definition is the same for every product in a category — a single product cannot be assigned a custom cohort.

The version-bump rule is a pre-commitment: if we change the wine cohort from ±20% price to ±15%, every wine's cohort percentile updates simultaneously, with a dated note in the changelog. We can't cherry-pick.

Frequently asked questions

Why ±20% price for the cohort and not a fixed dollar band?+

Multiplicative bands generalise across categories. A $200 hotel and a $20,000-per-night hotel both deserve a comparison group of "similarly-priced peers", and the relative band gives them one without us hand-tuning per category. The 20% number itself is a published constant; we ran sensitivity analyses at 10%, 20%, 30%, 50% and 20% produced the most stable cohort sizes across categories.

Doesn't showing both percentiles confuse people?+

Less than the alternative. The alternative is showing one percentile and pretending it answers every question, which is exactly the failure mode that creates Simpson-style paradoxes. Our user-testing showed that readers correctly interpreted "global 62, cohort 91" as "strong within its price class, ordinary across all wine" without a tutorial. The gap is itself useful information.

How does this interact with the affiliate routing?+

Affiliate routing is independent of the score. The "Buy" link points to whatever retailer maximises (lowest observed price × highest commission), regardless of cohort or global percentile. A product that's global 50 / cohort 95 gets the same routing logic as a product that's global 95 / cohort 50.

Where can I see the worked-example data?+

The numbers above are illustrative — Domaine Foo isn't a real wine. Once real Vivino + Wine Spectator + Robert Parker data lands, every /reviews/<slug>/ page will show both percentiles with the actual cohort label and reviewer count. Until then the demo data on each review page is randomly seeded and explicitly noindex'd.

Series: ← Statistics can lie (hub) · Next: The small-sample illusion →