Methodology

By Ryan Siegal · Founder and Principal

Published 2026-04-23·Last updated 2026-04-27

Who is doing this, and why the math

Rankquant is a team of statistically-trained editors. We are not a review site that hired a data scientist; we are applied statisticians who believe the way review scores are published today is straightforwardly broken. Our job on this page is to explain, in plain English paired with the exact equations, how we turn messy real-world reviews into a number you can actually trust.

None of this math is novel. Per-reviewer normalization, reviewer fixed-effects aggregation, and confidence-interval-based ranking are textbook techniques in psychometrics, meta-analysis, and sports analytics — see /theory/founding-metrics for a first-principles tour of the five primitives this pipeline is built from. What's different is publishing them, applied rigorously, as the basis of a consumer review site.

The problem every review site has

Every major review surface suffers severe right-skew inflation. Raw averages cannot distinguish a genuinely exceptional product from a merely average one.

4.4/5 avg

Amazon's long-run average across billions of reviews across all categories.

Marketplace Pulse review analyses

4.2/5 avg

Yelp's long-run category-averaged restaurant rating.

Yelp transparency reports

8.4/10 avg

Booking.com's long-run hotel rating; hotels below 8.0 are flagged as low-rated.

Booking.com scoring bands

When everything is "excellent," the word carries no information. And there's a second problem hiding underneath the inflation: different reviewers use the same numbers to mean different things. One reviewer's 88 is another reviewer's ceiling; one's 3-stars is another's "I liked it." Averaging raw numbers across reviewers mixes those private scales together and throws away signal.

Our fix has two moves. First, normalize every reviewer onto their own z-scale so personal grading habits wash out. Second, rank on the lower bound of a confidence interval so the number of reviews actually matters.

The method, in four steps

Step 1 — per-reviewer z-score normalization

For each reviewer u, compute the reviewer's personal mean μ_u and standard deviation σ_u over all of their reviews across all products in our database. Their rating of product i is then converted to a z-score:

z_{u,i}  =  ( r_{u,i}  −  μ_u )  /  σ_u

  where
    r_{u,i} = reviewer u's raw rating of product i
    μ_u     = reviewer u's personal mean rating across all their reviews
    σ_u     = reviewer u's personal standard deviation (Bessel-corrected, df = n_u − 1)
    n_u     = number of reviews reviewer u has written in our dataset

Two reviewers who disagree on scale but agree on quality produce identical z-scores. A reviewer who always rates 90–96 and a reviewer who always rates 75–88 will both produce z ≈ +1.3for their personal favorites — because that's where each reviewer sits relative to their own distribution. The z-score is dimensionless. It's the unit reviewers actually share.

Who counts as a "qualifying reviewer"?

Not every reviewer's opinions belong in the z-scale. Two conditions:

n_u ≥ 2. One review per reviewer contains no personal distribution — μ_u and σ_u are undefined.
σ_u > 0. A reviewer who has given every product the same rating carries no signal about relative quality. Including them amounts to dividing by zero.

A reviewer with n_u ≥ 2 but σ_u = 0isn't useless — they just can't be normalized the standard way. We keep them on file and include them in the R3broadened lens (Step 2), where we impute a plausible σ from their source's pooled dispersion.

Cross-category pooling. A reviewer who writes both wine and bourbon reviews goes into one reviewer pool with one μ_u and one σ_ucomputed across everything they've rated. That's deliberate: a person's personal scale is a property of the person, not the product category. Pooling keeps n_u large and σ_u stable.

Step 2 — three aggregation lenses

After Step 1, every product has a list of z-scores — one per qualifying reviewer. We combine them three ways because each lens answers a different question.

Three aggregation lenses. R1 is the headline number; R2 and R3 give interpretive context.
R1 — Pure relative (headline)	Unweighted mean of reviewer z-scores. Every qualifying reviewer counts equally. Treats an amateur reviewer with 5 reviews exactly like a professional with 5,000. Asks: "What does the crowd of people who rated this think, relative to their own scales?"
R2 — Source-weighted	Weighted mean with weights published per source. Reviewers from sources we consider more credible (e.g. Wine Advocate, NYT Book Review, Michelin) count more. Asks: "What does the crowd think if we listen more to the most rigorous reviewers?"
R3 — Broadened	Same as R1 but includes reviewers with n_u ≥ 2 and σ_u = 0 (we impute σ). Wider net; more signal when a product has few variance-producing reviewers. Asks: "What happens if we add in the consistent reviewers we normally have to drop?"

Three aggregation lenses. R1 is the headline number; R2 and R3 give interpretive context.

The three formulas:

R1(i)  =  (1 / N_i)   ·  Σ_{u ∈ Q_i}  z_{u,i}           ← equal weight per reviewer

R2(i)  =  ( Σ w_s · z_{u,i} )  /  ( Σ w_s )            ← source-weighted

R3(i)  =  (1 / N'_i)  ·  Σ_{u ∈ Q'_i}  z̃_{u,i}         ← includes σ_u = 0 reviewers

  Q_i  = qualifying reviewers of product i  (n_u ≥ 2, σ_u > 0)
  Q'_i = Q_i  ∪  {reviewers with n_u ≥ 2, σ_u = 0}
  w_s  = published credibility weight of reviewer u's source s
  z̃_{u,i} = (r_{u,i} − μ_u) / σ̃_s,  where σ̃_s is the pooled SD of source s
           (used only when a reviewer's own σ_u = 0)

R1 is primary.It's the number that headlines every product card. R2 and R3 sit beside it as context; large gaps between them are diagnostic and drive our taglines (see Step 4 below).

Source weights (R2 only)

Source weights are published per category. They do not affect R1 or R3. They represent our editorial judgment about which sources have historically produced the most decision-useful reviews — trained professional tasting panels, rigorous editorial gatekeeping, documented manipulation resistance. Weights live in our open-source repo and any change requires a public version bump with changelog. We do not accept payment to adjust source weights.

Step 3 — the 90% CI-floor

The most important step, and the one that matters most for what you see on a product page. A product with 4 reviewers averaging z = +2.1 should not outrank a product with 80 reviewers averaging z = +1.6. The thin-sample product might be exceptional — or its mean might be noise. The thick-sample product has paid its statistical dues.

So instead of ranking on the mean, we rank on the lower bound of the 90% confidence intervalaround the mean. That floor is the question: "given this sample size, what's a defensibly pessimistic estimate of the product's true quality?"

For each aggregate (R1, R2, R3):

   Ẑ       = the aggregated mean z-score            (R1, R2, or R3)
   SE(Ẑ)   = 1 / √N_eff                             (standard error of the mean)
   floor   = Ẑ  −  1.645 · SE(Ẑ)                    ← one-tailed 90% CI lower bound

   N_eff  = effective sample size
          = N                                       (for R1 and R3)
          = ( Σ w_s )² / Σ w_s²                     (for R2; Kish effective size)

Worked example — two real-world shapes:

Product	N (reviewers)	Mean Ẑ	SE(Ẑ)	CI-floor (90%)	Rank by
Thin-sample darling	4	+2.10	0.500	+1.28	Mean = 2.10, Floor = 1.28
Well-reviewed consensus	80	+1.60	0.112	+1.42	Mean = 1.60, Floor = 1.42

On raw mean, the darling wins +2.10 to +1.60. On CI-floor — how we actually rank — the consensus product wins +1.42 to +1.28. This is by design. It's the same intuition Bayesian sports ratings, IMDb's Top 250 formula, and Wilson score confidence intervals (used by Reddit and Yelp internally) all use: penalize uncertainty, reward consistency.

Step 4 — convert the CI-floor to a 0–100 percentile

CI-floors are nice for math but not meaningful to a reader. So we rank every product's CI-floor against every other product's CI-floor in our database and express the result as a percentile using the empirical CDF:

p_R1_global(i)  =  100 · rank(  floor_R1(i)  )  /  N_total

similarly for p_R2_global, p_R3_global

  rank() = ordinal rank, ties split at midpoint
  N_total = number of products with a valid R1 CI-floor

A percentile of 90means the product's R1 CI-floor is higher than 90% of every other product's R1 CI-floor in the database. 50 is the median. 10means the product's floor sits in the bottom 10% of the database.

Cohort percentiles — a re-ranking, not a re-computation

Global percentile answers "how does this product compare to everything we measure?" — which is useful but sometimes unfair. A $14 bottle that beats all other $14 bottles is doing exactly what a $14 bottle should do; showing it at the 25th global percentile (against $200 Burgundy) hides that achievement.

So we also publish a cohort percentile. It's the same CI-floor, re-ranked within a narrower peer group.

Cohort(i) = { j : category(j) = category(i)
                AND |price(j) − price(i)| / price(i) ≤ 0.20 }

p_R1_cohort(i) = 100 · rank(floor_R1(i)) / |Cohort(i)|
                  within Cohort(i)

No new math.Cohort uses the same R1 CI-floor computed globally — it's just ranked against fewer competitors. Same for R2 and R3. This keeps the pipeline fast, the storage cheap, and the output auditable. You can verify our cohort score by (a) looking up the product's raw CI-floor, (b) listing the cohort members, and (c) computing the rank yourself.

Category is the coarsest grouping ("wine," "bourbon," "single-origin coffee," etc.) and price is the list price at the time of most recent review. The ±20% band is symmetric: a $100 bottle's cohort is $80–$120.

Reading the spread — where taglines come from

R1, R2, and R3 are three views of the same product. When they agree, the product is simple to describe. When they diverge, the divergence itself is the story, and that's what our on-card taglines express.

Spread pattern	What it means	Tagline style
R1 high, R1-cohort much higher	Exceptional relative to its price peers; less dominant globally.	"Best in its price class; more moderate globally."
R1 high, R1-cohort lower	Strong overall but priced into a tough cohort.	"A strong global performer with fierce cohort competition."
R1 high, R2 low	Crowd-beloved; professional critics are cooler.	"Crowd-beloved; professional critics rate this lower."
R1 low, R2 much higher	Critics rate this far above the crowd consensus.	"Professional critics rate this far above the crowd consensus."
R3 > R1	Low-variance reviewers push it up — suggests broad consistency.	"Strong consensus even among less-discerning reviewers."
All three cluster tightly above 75	Universally well-regarded.	"Exceptional by every lens we apply."

The full tagline decision tree lives in lib/taglines.ts in the repository. Anyone can read it. No hidden editorial hand.

Where degrees of freedom enter

The Bessel-corrected n_u − 1 in σ_u is degrees-of-freedom. A reviewer with n_u = 2 has df = 1, and their σ_u is very noisy — but they still enter the z-score at full weight. The CI-floor step is where this is properly penalized: reviewers with thin personal distributions produce noisier z, which increases the per-product variance, which widens the product's SE, which lowers its CI-floor. Thinness is penalized at the product level, not the reviewer level.

The full treatment, including N_effderivations for R2's weighted case, is at /theory/degrees-of-freedom/. The CI-floor derivation and the choice of 90% (vs. 95% or 99%) is at /theory/confidence-intervals/.

Affiliate routing (fully disclosed)

Our "Buy" link for any product is selected by a published routing formula combining lowest observed price with highest affiliate commission across our authorized retailers. Every retailer price is shown openly on the page.

For each retailer r in authorized_retailers(product):
    score(r) = price_attractiveness(r) · commission_rate(r)
    price_attractiveness(r) = 1 − (price_r − min_price) / min_price

Primary Buy link = argmax_r score(r)
Secondary links = top-3 by raw price

Affiliate commissions never influence the normalized score.They affect which retailer we feature for "Buy." The score is deterministic given reviewer ratings, source weights, and published constants; you can fork the repository, plug in different weights, and reproduce or contest every number we publish.

Reproducibility and open source

If you disagree with our output, you can check our work. Every step is published. The code is open. Fork it, improve it, cite it.
— Editorial standard

The reference normalization is published at https://github.com/rankquant, MIT-licensed, pip-installable. Every product page shows its per-product intermediates (R1, R2, R3, CI-floors, reviewer count, effective sample size) in full so you can audit any individual score without running the code.

Frequently asked questions

Why normalize per reviewer rather than per source?+

Because reviewers, not publications, produce the scale. Two critics writing for the same publication can have wildly different personal ranges; a single critic reviewing both wine and bourbon uses one personal scale across both. Per-reviewer normalization is the finest grain that still has enough data — n_u ≥ 2 — to estimate a personal mean and standard deviation.

Why the minimum σ_u > 0, not σ_u > some small number?+

Because any positive floor is arbitrary and noisy. A reviewer with σ_u = 0.2 is just as exploitable by the z-score formula as one with σ_u = 2. What we actually need to exclude is reviewers whose σ is mathematically undefined or zero (constant raters). Those go into the R3 broadened lens with an imputed σ.

Why 90% confidence rather than 95%?+

A consumer review site is a different context from a drug trial. At 95%, the CI-floor collapses toward zero for products with fewer than ~15 reviewers — we'd lose too much signal. At 90%, the floor stays useful at N ≥ 6 while still penalizing thin samples substantially. 90% one-tailed corresponds to z = 1.645; 95% would be z = 1.960. The choice is a defaults-matter call, not a mathematical one, and it's published and version-stable.

Why make cohort a re-ranking rather than a separate computation?+

Speed, storage, auditability. Re-ranking the same CI-floor within a cohort is O(k log k) per product; re-computing from scratch would require materializing a separate z-score table per cohort, which multiplies storage by the number of cohorts. The statistical information content is identical — a product's position within its cohort is fully determined by its CI-floor and its cohort members' CI-floors.

What does ±20% price do at the edges?+

The symmetric ±20% band means a $100 product's cohort is $80–$120. For very cheap or very expensive products where the natural ±20% window produces thin cohorts, we fall back to the nearest-N rule: expand the band until the cohort contains at least 20 members. This edge-case rule is logged on the product page whenever invoked.

Can brands pay to rank higher?+

No. Affiliate routing determines which retailer we link for "Buy" and which we feature in the secondary grid, but it does not influence R1, R2, R3, any CI-floor, or any percentile. The score is deterministic given the reviewer data and published constants.

Why include reviewers from other categories in the same reviewer pool?+

A person's rating scale is a property of the person, not the category. If a reviewer rates wine in a 6-point range and bourbon in a 4-point range, combining them gives a better estimate of their actual personal standard deviation than either alone. Category-specific pools would throw away information and shrink sample sizes for cross-category reviewers. Cross-category pooling assumes the person's rating calibration is somewhat stable across product types — an assumption we believe holds in the data but continue to validate.

How often are scores updated?+

When new reviews arrive the full pipeline re-runs. For products with active review volume that's a daily or weekly update. Historical scores remain visible so you can see how an evaluation has evolved. Every score on the site carries its computation date.

How is this different from IMDb's weighted average or Reddit's Wilson score?+

It's the same family. IMDb's Top 250 uses a Bayesian shrinkage toward a global prior (our old v1 methodology used a similar step). Reddit's "best" sort uses a Wilson score lower bound on an up/down-vote binomial. We use a lower-bound-of-CI approach too — but on a per-reviewer normalized z-score rather than a raw vote or rating. The innovation isn't the CI-floor; it's applying it after per-reviewer normalization so we're ranking the right signal in the first place.

What if a product has no qualifying reviewers?+

We don't publish a normalized score. The product appears in our database as "not rated" with an explanation and, where possible, a pointer to the source reviews. We do not fabricate a score from thin data — that would defeat the entire point.