Founding metrics — the five statistics primitives behind every Rankquant percentile
By Ryan Siegal · Founder and Principal
Why this page exists
The methodology page tells you what Rankquant does. The theory hub tells you why each of the four pipeline steps is rigorous. This page sits one level deeper — it tells you why the five underlying primitives are the right ingredients in the first place. Read it if you want to understand the methodology from first principles, debate the choice of estimator with us, or fork the open-source implementation and swap a primitive for one you think is better. Every primitive below is paired with the canonical statistical reference it traces back to.
1. The z-score — scale-free comparison across reviewers
The z-score is a re-expression of a value as the number of standard deviations it sits above or below the mean of its reference distribution:
z = ( x − μ ) / σ
For Rankquant: z_{u,i} = ( r_{u,i} − μ_u ) / σ_u
r_{u,i} reviewer u's rating of product i
μ_u reviewer u's personal mean across all their ratings
σ_u reviewer u's personal Bessel-corrected SD (df = n_u − 1)Why this is the right primitive for combining reviewers
Two reviewers who agree on relative quality but disagree on absolute scale produce numerically different raw ratings of the same product but identical z-scores. A reviewer whose rating range is 90–96 and a reviewer whose rating range is 75–88 will both produce z ≈ +1.3 for their personal favorites. The z-score captures the only thing reviewers actually share: ordering and relative spread within their own rating universe.
The standard reference is psychometrics: per-rater normalization is the foundational move in essay scoring (where graders use different personal scales) and in inter-laboratory measurement comparison (where instruments have different baselines). The same logic shows up in standardized testing (SAT/GRE/GMAT score scaling), grade curves at universities, and Olympic-style judging where each judge's scoring tendencies are corrected before medals are awarded.
Why we pool z-scores cross-category
A reviewer who rates 50 wines and 50 books has one personal scale, not two. People don't reset their generosity bias when they switch domains. We pool a reviewer's entire rating history to estimate μ_u and σ_u rather than computing a separate z-scale per category. The technical justification — and the n_u ≥ 2 admission rule — is detailed at /theory/degrees-of-freedom.
2. The standard error of the mean — why √n is everything
The standard error of a sample mean tells you how far the sample mean is likely to be from the population mean it is trying to estimate:
SE(x̄) = σ / √n
For z-scores (which have unit variance by construction): SE(Ẑ) = 1 / √nThis single equation is why a 4-reviewer product cannot fake-rank an 80-reviewer one. Quadrupling the sample size halves the SE. Going from 4 reviewers to 80 reviewers shrinks the SE by a factor of √(80/4) ≈ 4.47×. The mean of 80 reviewers is 4.47 times more precise than the mean of 4 — and any sensible ranking has to encode that.
A worked example
Product A: 4 reviewers, mean z = +2.1
SE(Ẑ_A) = 1/√4 = 0.500
Product B: 80 reviewers, mean z = +1.6
SE(Ẑ_B) = 1/√80 = 0.112
Product A has a HIGHER mean but FOUR TIMES the wobble.
The wobble is what makes its rank uncertain.The standard error is the foundation of every defensible aggregation method we know of, from clinical trials to election polling to FiveThirtyEight's sports models. It is also, fittingly, where the central limit theorem earns its rent: as n grows, the distribution of the sample mean converges on a normal distribution with standard deviation SE — which is what lets us compute confidence intervals without needing the population distribution itself to be normal.
3. The 90% one-tailed confidence interval — what we publish
A confidence interval for an estimate Ẑ is a range of values that, with stated confidence, contains the true population value. The one-tailed floor of a 90% CI is:
floor = Ẑ − z_{α=0.10} · SE(Ẑ)
= Ẑ − 1.645 · SE(Ẑ)The CI-floor is the answer to the question: given the data we've observed, what is a defensibly pessimistic estimate of this product's true normalized score? Rankquant ranks products on the CI-floor, not on the mean. Products with high means but few reviewers have wide intervals and low floors; products with slightly lower means but many reviewers have tight intervals and higher floors. The full derivation, including the choice of 90% over 95%, is at /theory/confidence-intervals.
| Wilson score interval (binary outcomes) | Reddit's "best comments" sort and Yelp's internal ranking. Same idea applied to up/down vote binomials. Edwin Wilson, 1927. |
|---|---|
| Bayesian shrinkage (Beta-Binomial / Normal-Normal) | IMDb's Top 250 formula and credibility theory in actuarial science. Pulls thin-sample estimates toward a prior. Mathematically dual to a CI-floor under standard assumptions. |
| Hodges–Lehmann estimator | Rank-based estimate of a location parameter. Robust to non-normality; influences our empirical-CDF step. |
| DerSimonian–Laird random-effects estimator | Standard meta-analysis aggregation that combines within-study and between-study variance. A future Rankquant methodology upgrade. |
4. The empirical CDF — distribution-free percentiles
The empirical cumulative distribution function is the data-driven estimate of the true CDF. Given n observations, it is:
F̂(x) = ( 1 / n ) · Σ I( X_i ≤ x )
For Rankquant: percentile(P) = F̂(floor_P) · 100Each product's percentile is the fraction of all products whose CI-floor is at or below it, scaled to 0–100. There is no parametric assumption about the distribution of CI-floors. This matters because the population of CI-floors is not normal: in practice it's right-skewed (a few exceptional products) and bounded above (no product can score better than "everyone agrees this is exceptional").
The empirical CDF is, by the Glivenko–Cantelli theorem, a uniformly consistent estimator of the true CDF — meaning as our database grows, the percentiles we publish converge on what they would be against the full population of products in each category. It is the foundation of nonparametric statistics and the reason rank-based methods are robust where parametric ones break.
Why not the normal CDF (Φ)?
We could compute percentiles via p = Φ(Ẑ)— feeding the mean z-score through the standard-normal CDF. We don't, because that assumes the population of product means is normal. Our data shows it isn't, and assuming normality where it doesn't hold introduces bias we can't justify. The empirical CDF makes no such assumption.
5. The Kish design effect — weighting without overcounting
When Rankquant's R2 lens applies unequal source weights ws, the simple sample size N overstates the precision of the weighted mean. The right count is the effective sample size, given by Kish's design effect:
N_eff = ( Σ w_s )² / Σ w_s²
Equality (N_eff = N_raw) only when every w_s is equal.
Otherwise N_eff < N_raw — sometimes much less.A weighted mean of 100 reviewers where 90 carry weight 1 and 10 carry weight 10 has N_eff = (190)² / (90·1 + 10·100) = 36,100 / 1,090 ≈ 33. It is statisticallya 33-reviewer mean, even though it's nominally a 100-reviewer mean. The correction shows up in the SE that feeds the CI-floor at step 3.
The Kish design effect originates in survey statistics — Kish (1965), Survey Sampling — and is the standard correction in any weighted-aggregation context (epidemiology, polling, meta-analysis). Without it, source-weighted lenses would produce optimistically narrow CIs and over-rank products with reviewer mixes that lean heavily on a few high-weight sources.
How the five primitives compose into the pipeline
| 1. Z-score | Per-reviewer normalization. Removes scale bias before aggregation. Reference: psychometrics, e.g. Cronbach (1951); SAT/GRE score scaling. |
|---|---|
| 2. Standard error | Quantifies sampling uncertainty in each aggregate. SE(Ẑ) = 1/√N for z-scores. Reference: central limit theorem; any first-year statistics text. |
| 3. 90% CI-floor | Defensibly pessimistic point estimate. Ẑ − 1.645 · SE. Same family as Wilson score and IMDb-style Bayesian shrinkage. |
| 4. Empirical CDF | Distribution-free percentile mapping. F̂(x) = (1/n) Σ I(X_i ≤ x). Reference: Glivenko–Cantelli (1933); nonparametric statistics. |
| 5. Kish design effect | Effective sample size under unequal weights. N_eff = (Σw)² / Σw². Reference: Kish (1965). |
What is novel here
None of these primitives is novel. Combining them as Rankquant does — per-reviewer z-score → CI-floor → empirical-CDF percentile, with optional Kish-corrected source weighting — is a textbook composition. What is novel is publishing it as the basis of a consumer review site. Every other review surface we are aware of publishes either a raw average or a weakly-shrunken average and stops. Rankquant publishes the math.
If you disagree with any of the five primitives, the open-source implementation at https://github.com/rankquant is configured to let you swap them out. Replace the empirical CDF with the normal CDF; replace the 90% CI-floor with a 95%; replace the per-reviewer z-score with a per-source z-score. Run the pipeline on the same data. Compare the rankings. The methodology is meant to be argued with — that is the whole point of publishing it.
Frequently asked questions
Why frequentist confidence intervals instead of Bayesian credible intervals?+
Why one-tailed instead of two-tailed?+
Why pool reviewers cross-category for the z-score?+
What about ordinal-vs-interval scale concerns?+
What's the smallest N where this all works?+
Where do these primitives break down?+
Continue: Degrees of freedom → · Confidence intervals → · Inter-rater reliability →
Or: Statistics can lie — here's how, and what we do about it →