Founding metrics — the five statistics primitives behind every Rankquant percentile

By Ryan Siegal · Founder and Principal

Published 2026-04-27

Short answer

Rankquant rests on five textbook statistics primitives, each chosen because it is the right tool for a specific job. The z-scorestrips out per-reviewer scale bias by re-expressing every rating in standard-deviation units relative to that reviewer's own distribution. The standard error of the mean (SE = σ / √n) tells us how much the sample mean can wobble around the true mean and shrinks at the rate of √n. The 90% one-tailed confidence interval gives us a defensibly pessimistic estimate of a product's true normalized score, punishing thin samples without erasing them. The empirical cumulative distribution function turns CI-floors into a calibrated 0–100 percentile without assuming any parametric distribution. The Kish design effect handles unequal-credibility reviewers without overstating sample size. Each is taught in any first-year statistics course; their composition is what is novel-as-applied.

Why this page exists

The methodology page tells you what Rankquant does. The theory hub tells you why each of the four pipeline steps is rigorous. This page sits one level deeper — it tells you why the five underlying primitives are the right ingredients in the first place. Read it if you want to understand the methodology from first principles, debate the choice of estimator with us, or fork the open-source implementation and swap a primitive for one you think is better. Every primitive below is paired with the canonical statistical reference it traces back to.

1. The z-score — scale-free comparison across reviewers

The z-score is a re-expression of a value as the number of standard deviations it sits above or below the mean of its reference distribution:

z = ( x − μ ) / σ

For Rankquant: z_{u,i}  =  ( r_{u,i}  −  μ_u )  /  σ_u

  r_{u,i}  reviewer u's rating of product i
  μ_u      reviewer u's personal mean across all their ratings
  σ_u      reviewer u's personal Bessel-corrected SD (df = n_u − 1)

Why this is the right primitive for combining reviewers

Two reviewers who agree on relative quality but disagree on absolute scale produce numerically different raw ratings of the same product but identical z-scores. A reviewer whose rating range is 90–96 and a reviewer whose rating range is 75–88 will both produce z ≈ +1.3 for their personal favorites. The z-score captures the only thing reviewers actually share: ordering and relative spread within their own rating universe.

The standard reference is psychometrics: per-rater normalization is the foundational move in essay scoring (where graders use different personal scales) and in inter-laboratory measurement comparison (where instruments have different baselines). The same logic shows up in standardized testing (SAT/GRE/GMAT score scaling), grade curves at universities, and Olympic-style judging where each judge's scoring tendencies are corrected before medals are awarded.

Why we pool z-scores cross-category

A reviewer who rates 50 wines and 50 books has one personal scale, not two. People don't reset their generosity bias when they switch domains. We pool a reviewer's entire rating history to estimate μ_u and σ_u rather than computing a separate z-scale per category. The technical justification — and the n_u ≥ 2 admission rule — is detailed at /theory/degrees-of-freedom.

2. The standard error of the mean — why √n is everything

The standard error of a sample mean tells you how far the sample mean is likely to be from the population mean it is trying to estimate:

SE(x̄)  =  σ / √n

For z-scores (which have unit variance by construction): SE(Ẑ)  =  1 / √n

This single equation is why a 4-reviewer product cannot fake-rank an 80-reviewer one. Quadrupling the sample size halves the SE. Going from 4 reviewers to 80 reviewers shrinks the SE by a factor of √(80/4) ≈ 4.47×. The mean of 80 reviewers is 4.47 times more precise than the mean of 4 — and any sensible ranking has to encode that.

A worked example

Product A:  4 reviewers, mean z = +2.1
   SE(Ẑ_A)  =  1/√4   =  0.500

Product B: 80 reviewers, mean z = +1.6
   SE(Ẑ_B)  =  1/√80  =  0.112

Product A has a HIGHER mean but FOUR TIMES the wobble.
The wobble is what makes its rank uncertain.

The standard error is the foundation of every defensible aggregation method we know of, from clinical trials to election polling to FiveThirtyEight's sports models. It is also, fittingly, where the central limit theorem earns its rent: as n grows, the distribution of the sample mean converges on a normal distribution with standard deviation SE — which is what lets us compute confidence intervals without needing the population distribution itself to be normal.

3. The 90% one-tailed confidence interval — what we publish

A confidence interval for an estimate Ẑ is a range of values that, with stated confidence, contains the true population value. The one-tailed floor of a 90% CI is:

floor  =  Ẑ  −  z_{α=0.10}  ·  SE(Ẑ)
       =  Ẑ  −  1.645 · SE(Ẑ)

The CI-floor is the answer to the question: given the data we've observed, what is a defensibly pessimistic estimate of this product's true normalized score? Rankquant ranks products on the CI-floor, not on the mean. Products with high means but few reviewers have wide intervals and low floors; products with slightly lower means but many reviewers have tight intervals and higher floors. The full derivation, including the choice of 90% over 95%, is at /theory/confidence-intervals.

The CI-floor sits in a well-known family of small-sample-safe ranking primitives.
Wilson score interval (binary outcomes)	Reddit's "best comments" sort and Yelp's internal ranking. Same idea applied to up/down vote binomials. Edwin Wilson, 1927.
Bayesian shrinkage (Beta-Binomial / Normal-Normal)	IMDb's Top 250 formula and credibility theory in actuarial science. Pulls thin-sample estimates toward a prior. Mathematically dual to a CI-floor under standard assumptions.
Hodges–Lehmann estimator	Rank-based estimate of a location parameter. Robust to non-normality; influences our empirical-CDF step.
DerSimonian–Laird random-effects estimator	Standard meta-analysis aggregation that combines within-study and between-study variance. A future Rankquant methodology upgrade.

The CI-floor sits in a well-known family of small-sample-safe ranking primitives.

4. The empirical CDF — distribution-free percentiles

The empirical cumulative distribution function is the data-driven estimate of the true CDF. Given n observations, it is:

F̂(x)  =  ( 1 / n )  ·  Σ  I( X_i ≤ x )

For Rankquant:  percentile(P)  =  F̂(floor_P)  ·  100

Each product's percentile is the fraction of all products whose CI-floor is at or below it, scaled to 0–100. There is no parametric assumption about the distribution of CI-floors. This matters because the population of CI-floors is not normal: in practice it's right-skewed (a few exceptional products) and bounded above (no product can score better than "everyone agrees this is exceptional").

The empirical CDF is, by the Glivenko–Cantelli theorem, a uniformly consistent estimator of the true CDF — meaning as our database grows, the percentiles we publish converge on what they would be against the full population of products in each category. It is the foundation of nonparametric statistics and the reason rank-based methods are robust where parametric ones break.

Why not the normal CDF (Φ)?

We could compute percentiles via p = Φ(Ẑ)— feeding the mean z-score through the standard-normal CDF. We don't, because that assumes the population of product means is normal. Our data shows it isn't, and assuming normality where it doesn't hold introduces bias we can't justify. The empirical CDF makes no such assumption.

5. The Kish design effect — weighting without overcounting

When Rankquant's R2 lens applies unequal source weights w_s, the simple sample size N overstates the precision of the weighted mean. The right count is the effective sample size, given by Kish's design effect:

N_eff  =  ( Σ w_s )²  /  Σ w_s²

  Equality (N_eff = N_raw) only when every w_s is equal.
  Otherwise N_eff < N_raw — sometimes much less.

A weighted mean of 100 reviewers where 90 carry weight 1 and 10 carry weight 10 has N_eff = (190)² / (90·1 + 10·100) = 36,100 / 1,090 ≈ 33. It is statisticallya 33-reviewer mean, even though it's nominally a 100-reviewer mean. The correction shows up in the SE that feeds the CI-floor at step 3.

The Kish design effect originates in survey statistics — Kish (1965), Survey Sampling — and is the standard correction in any weighted-aggregation context (epidemiology, polling, meta-analysis). Without it, source-weighted lenses would produce optimistically narrow CIs and over-rank products with reviewer mixes that lean heavily on a few high-weight sources.

How the five primitives compose into the pipeline

The five primitives, their pipeline role, and their canonical references.
1. Z-score	Per-reviewer normalization. Removes scale bias before aggregation. Reference: psychometrics, e.g. Cronbach (1951); SAT/GRE score scaling.
2. Standard error	Quantifies sampling uncertainty in each aggregate. SE(Ẑ) = 1/√N for z-scores. Reference: central limit theorem; any first-year statistics text.
3. 90% CI-floor	Defensibly pessimistic point estimate. Ẑ − 1.645 · SE. Same family as Wilson score and IMDb-style Bayesian shrinkage.
4. Empirical CDF	Distribution-free percentile mapping. F̂(x) = (1/n) Σ I(X_i ≤ x). Reference: Glivenko–Cantelli (1933); nonparametric statistics.
5. Kish design effect	Effective sample size under unequal weights. N_eff = (Σw)² / Σw². Reference: Kish (1965).

The five primitives, their pipeline role, and their canonical references.

What is novel here

None of these primitives is novel. Combining them as Rankquant does — per-reviewer z-score → CI-floor → empirical-CDF percentile, with optional Kish-corrected source weighting — is a textbook composition. What is novel is publishing it as the basis of a consumer review site. Every other review surface we are aware of publishes either a raw average or a weakly-shrunken average and stops. Rankquant publishes the math.

If you disagree with any of the five primitives, the open-source implementation at https://github.com/rankquant is configured to let you swap them out. Replace the empirical CDF with the normal CDF; replace the 90% CI-floor with a 95%; replace the per-reviewer z-score with a per-source z-score. Run the pipeline on the same data. Compare the rankings. The methodology is meant to be argued with — that is the whole point of publishing it.

Frequently asked questions

Why frequentist confidence intervals instead of Bayesian credible intervals?+

Both are defensible. Frequentist CI-floors don't require specifying a prior strength k, are scale-free across z-scored data, and are easier for a user to reproduce by hand. Bayesian shrinkage is a future methodology upgrade we may publish as an alternative R-lens; under standard noninformative priors the rankings are nearly identical.

Why one-tailed instead of two-tailed?+

We are asking a one-tailed question: "what is a defensibly low estimate of this product's quality?" The upper bound is irrelevant for ranking — no product gets promoted by having a lucky ceiling. Two-tailed CIs would over-penalize products with upside uncertainty.

Why pool reviewers cross-category for the z-score?+

A reviewer's personal scale is a property of the reviewer, not of the category. Someone who rates everything 90+ on a 100-point wine scale also rates everything 4.5+ on Goodreads. Pooling cross-category increases the sample size for estimating μ_u and σ_u, which in turn makes the z-score itself less noisy.

What about ordinal-vs-interval scale concerns?+

Strictly speaking, consumer ratings are ordinal — the gap between 4 and 5 stars is not necessarily the same as between 3 and 4. Treating them as interval is a common simplifying assumption that's defensible given the regularity of rating distributions across millions of users. The empirical-CDF step at the end implicitly recovers a rank-based interpretation; the percentile is invariant to monotone transformations of the underlying score.

What's the smallest N where this all works?+

Per-reviewer z-score requires n_u ≥ 2 with σ_u > 0. The CI-floor numerically works at any N, though SE = 1/√N goes to 1.0 at N = 1 and the floor sits about 1.6 below the mean. Empirical-CDF percentiles need a populated database — at least ~30 products in the same comparison group for the percentile to be stable. We declare a category "ready" once the thinnest cohort hits ~30 products.

Where do these primitives break down?+

Three places. (1) When the z-score assumption that reviewers have monotone preferences is violated — a reviewer who likes both red and white wines but for different reasons may have multimodal preferences not captured by μ_u, σ_u. (2) When a category is too narrow for the empirical CDF to converge — sub-50-product categories produce noisy percentiles. (3) When the Kish design effect fails because weights are correlated with z-scores (high-weight sources also tend to rate products higher); we publish the source weights specifically to make this checkable.

Continue: Degrees of freedom → · Confidence intervals → · Inter-rater reliability →

Or: Statistics can lie — here's how, and what we do about it →