Methodology
By Ryan Siegal · Founder and Principal
·
Who is doing this, and why the math
Rankquant is a team of statistically-trained editors. We are not a review site that hired a data scientist; we are applied statisticians who believe the way review scores are published today is straightforwardly broken. Our job on this page is to explain, in plain English paired with the exact equations, how we turn messy real-world reviews into a number you can actually trust.
None of this math is novel. Per-reviewer normalization, reviewer fixed-effects aggregation, and confidence-interval-based ranking are textbook techniques in psychometrics, meta-analysis, and sports analytics — see /theory/founding-metrics for a first-principles tour of the five primitives this pipeline is built from. What's different is publishing them, applied rigorously, as the basis of a consumer review site.
The problem every review site has
Every major review surface suffers severe right-skew inflation. Raw averages cannot distinguish a genuinely exceptional product from a merely average one.
Amazon's long-run average across billions of reviews across all categories.
Marketplace Pulse review analyses
Yelp's long-run category-averaged restaurant rating.
Yelp transparency reports
Booking.com's long-run hotel rating; hotels below 8.0 are flagged as low-rated.
Booking.com scoring bands
When everything is "excellent," the word carries no information. And there's a second problem hiding underneath the inflation: different reviewers use the same numbers to mean different things. One reviewer's 88 is another reviewer's ceiling; one's 3-stars is another's "I liked it." Averaging raw numbers across reviewers mixes those private scales together and throws away signal.
Our fix has two moves. First, normalize every reviewer onto their own z-scale so personal grading habits wash out. Second, rank on the lower bound of a confidence interval so the number of reviews actually matters.
The method, in four steps
Step 1 — per-reviewer z-score normalization
For each reviewer u, compute the reviewer's personal mean μu and standard deviation σu over all of their reviews across all products in our database. Their rating of product i is then converted to a z-score:
z_{u,i} = ( r_{u,i} − μ_u ) / σ_u
where
r_{u,i} = reviewer u's raw rating of product i
μ_u = reviewer u's personal mean rating across all their reviews
σ_u = reviewer u's personal standard deviation (Bessel-corrected, df = n_u − 1)
n_u = number of reviews reviewer u has written in our datasetTwo reviewers who disagree on scale but agree on quality produce identical z-scores. A reviewer who always rates 90–96 and a reviewer who always rates 75–88 will both produce z ≈ +1.3for their personal favorites — because that's where each reviewer sits relative to their own distribution. The z-score is dimensionless. It's the unit reviewers actually share.
Who counts as a "qualifying reviewer"?
Not every reviewer's opinions belong in the z-scale. Two conditions:
- nu ≥ 2. One review per reviewer contains no personal distribution — μ_u and σ_u are undefined.
- σu > 0. A reviewer who has given every product the same rating carries no signal about relative quality. Including them amounts to dividing by zero.
A reviewer with nu ≥ 2 but σu = 0isn't useless — they just can't be normalized the standard way. We keep them on file and include them in the R3broadened lens (Step 2), where we impute a plausible σ from their source's pooled dispersion.
Cross-category pooling. A reviewer who writes both wine and bourbon reviews goes into one reviewer pool with one μu and one σucomputed across everything they've rated. That's deliberate: a person's personal scale is a property of the person, not the product category. Pooling keeps nu large and σu stable.
Step 2 — three aggregation lenses
After Step 1, every product has a list of z-scores — one per qualifying reviewer. We combine them three ways because each lens answers a different question.
| R1 — Pure relative (headline) | Unweighted mean of reviewer z-scores. Every qualifying reviewer counts equally. Treats an amateur reviewer with 5 reviews exactly like a professional with 5,000. Asks: "What does the crowd of people who rated this think, relative to their own scales?" |
|---|---|
| R2 — Source-weighted | Weighted mean with weights published per source. Reviewers from sources we consider more credible (e.g. Wine Advocate, NYT Book Review, Michelin) count more. Asks: "What does the crowd think if we listen more to the most rigorous reviewers?" |
| R3 — Broadened | Same as R1 but includes reviewers with n_u ≥ 2 and σ_u = 0 (we impute σ). Wider net; more signal when a product has few variance-producing reviewers. Asks: "What happens if we add in the consistent reviewers we normally have to drop?" |
The three formulas:
R1(i) = (1 / N_i) · Σ_{u ∈ Q_i} z_{u,i} ← equal weight per reviewer
R2(i) = ( Σ w_s · z_{u,i} ) / ( Σ w_s ) ← source-weighted
R3(i) = (1 / N'_i) · Σ_{u ∈ Q'_i} z̃_{u,i} ← includes σ_u = 0 reviewers
Q_i = qualifying reviewers of product i (n_u ≥ 2, σ_u > 0)
Q'_i = Q_i ∪ {reviewers with n_u ≥ 2, σ_u = 0}
w_s = published credibility weight of reviewer u's source s
z̃_{u,i} = (r_{u,i} − μ_u) / σ̃_s, where σ̃_s is the pooled SD of source s
(used only when a reviewer's own σ_u = 0)R1 is primary.It's the number that headlines every product card. R2 and R3 sit beside it as context; large gaps between them are diagnostic and drive our taglines (see Step 4 below).
Source weights (R2 only)
Source weights are published per category. They do not affect R1 or R3. They represent our editorial judgment about which sources have historically produced the most decision-useful reviews — trained professional tasting panels, rigorous editorial gatekeeping, documented manipulation resistance. Weights live in our open-source repo and any change requires a public version bump with changelog. We do not accept payment to adjust source weights.
Step 3 — the 90% CI-floor
The most important step, and the one that matters most for what you see on a product page. A product with 4 reviewers averaging z = +2.1 should not outrank a product with 80 reviewers averaging z = +1.6. The thin-sample product might be exceptional — or its mean might be noise. The thick-sample product has paid its statistical dues.
So instead of ranking on the mean, we rank on the lower bound of the 90% confidence intervalaround the mean. That floor is the question: "given this sample size, what's a defensibly pessimistic estimate of the product's true quality?"
For each aggregate (R1, R2, R3):
Ẑ = the aggregated mean z-score (R1, R2, or R3)
SE(Ẑ) = 1 / √N_eff (standard error of the mean)
floor = Ẑ − 1.645 · SE(Ẑ) ← one-tailed 90% CI lower bound
N_eff = effective sample size
= N (for R1 and R3)
= ( Σ w_s )² / Σ w_s² (for R2; Kish effective size)Worked example — two real-world shapes:
| Product | N (reviewers) | Mean Ẑ | SE(Ẑ) | CI-floor (90%) | Rank by |
|---|---|---|---|---|---|
| Thin-sample darling | 4 | +2.10 | 0.500 | +1.28 | Mean = 2.10, Floor = 1.28 |
| Well-reviewed consensus | 80 | +1.60 | 0.112 | +1.42 | Mean = 1.60, Floor = 1.42 |
On raw mean, the darling wins +2.10 to +1.60. On CI-floor — how we actually rank — the consensus product wins +1.42 to +1.28. This is by design. It's the same intuition Bayesian sports ratings, IMDb's Top 250 formula, and Wilson score confidence intervals (used by Reddit and Yelp internally) all use: penalize uncertainty, reward consistency.
Step 4 — convert the CI-floor to a 0–100 percentile
CI-floors are nice for math but not meaningful to a reader. So we rank every product's CI-floor against every other product's CI-floor in our database and express the result as a percentile using the empirical CDF:
p_R1_global(i) = 100 · rank( floor_R1(i) ) / N_total
similarly for p_R2_global, p_R3_global
rank() = ordinal rank, ties split at midpoint
N_total = number of products with a valid R1 CI-floorA percentile of 90means the product's R1 CI-floor is higher than 90% of every other product's R1 CI-floor in the database. 50 is the median. 10means the product's floor sits in the bottom 10% of the database.
Cohort percentiles — a re-ranking, not a re-computation
Global percentile answers "how does this product compare to everything we measure?" — which is useful but sometimes unfair. A $14 bottle that beats all other $14 bottles is doing exactly what a $14 bottle should do; showing it at the 25th global percentile (against $200 Burgundy) hides that achievement.
So we also publish a cohort percentile. It's the same CI-floor, re-ranked within a narrower peer group.
Cohort(i) = { j : category(j) = category(i)
AND |price(j) − price(i)| / price(i) ≤ 0.20 }
p_R1_cohort(i) = 100 · rank(floor_R1(i)) / |Cohort(i)|
within Cohort(i)No new math.Cohort uses the same R1 CI-floor computed globally — it's just ranked against fewer competitors. Same for R2 and R3. This keeps the pipeline fast, the storage cheap, and the output auditable. You can verify our cohort score by (a) looking up the product's raw CI-floor, (b) listing the cohort members, and (c) computing the rank yourself.
Category is the coarsest grouping ("wine," "bourbon," "single-origin coffee," etc.) and price is the list price at the time of most recent review. The ±20% band is symmetric: a $100 bottle's cohort is $80–$120.
Reading the spread — where taglines come from
R1, R2, and R3 are three views of the same product. When they agree, the product is simple to describe. When they diverge, the divergence itself is the story, and that's what our on-card taglines express.
| Spread pattern | What it means | Tagline style |
|---|---|---|
| R1 high, R1-cohort much higher | Exceptional relative to its price peers; less dominant globally. | "Best in its price class; more moderate globally." |
| R1 high, R1-cohort lower | Strong overall but priced into a tough cohort. | "A strong global performer with fierce cohort competition." |
| R1 high, R2 low | Crowd-beloved; professional critics are cooler. | "Crowd-beloved; professional critics rate this lower." |
| R1 low, R2 much higher | Critics rate this far above the crowd consensus. | "Professional critics rate this far above the crowd consensus." |
| R3 > R1 | Low-variance reviewers push it up — suggests broad consistency. | "Strong consensus even among less-discerning reviewers." |
| All three cluster tightly above 75 | Universally well-regarded. | "Exceptional by every lens we apply." |
The full tagline decision tree lives in lib/taglines.ts in the repository. Anyone can read it. No hidden editorial hand.
Where degrees of freedom enter
The Bessel-corrected nu − 1 in σu is degrees-of-freedom. A reviewer with nu = 2 has df = 1, and their σu is very noisy — but they still enter the z-score at full weight. The CI-floor step is where this is properly penalized: reviewers with thin personal distributions produce noisier z, which increases the per-product variance, which widens the product's SE, which lowers its CI-floor. Thinness is penalized at the product level, not the reviewer level.
The full treatment, including Neffderivations for R2's weighted case, is at /theory/degrees-of-freedom/. The CI-floor derivation and the choice of 90% (vs. 95% or 99%) is at /theory/confidence-intervals/.
Affiliate routing (fully disclosed)
Our "Buy" link for any product is selected by a published routing formula combining lowest observed price with highest affiliate commission across our authorized retailers. Every retailer price is shown openly on the page.
For each retailer r in authorized_retailers(product):
score(r) = price_attractiveness(r) · commission_rate(r)
price_attractiveness(r) = 1 − (price_r − min_price) / min_price
Primary Buy link = argmax_r score(r)
Secondary links = top-3 by raw priceAffiliate commissions never influence the normalized score.They affect which retailer we feature for "Buy." The score is deterministic given reviewer ratings, source weights, and published constants; you can fork the repository, plug in different weights, and reproduce or contest every number we publish.
Reproducibility and open source
If you disagree with our output, you can check our work. Every step is published. The code is open. Fork it, improve it, cite it.
The reference normalization is published at https://github.com/rankquant, MIT-licensed, pip-installable. Every product page shows its per-product intermediates (R1, R2, R3, CI-floors, reviewer count, effective sample size) in full so you can audit any individual score without running the code.