Degrees of freedom in the Rankquant pipeline

Q: Why does sample variance divide by n−1 instead of n?

Because the sample mean x̄ is estimated from the same data. When you use the sample mean as your center, the deviations (x_i − x̄) are constrained to sum to zero, so only n−1 of them vary independently. Dividing by n−1 (Bessel's correction) gives an unbiased estimator; dividing by n underestimates true variance.

Q: Is n_u = 2 really enough to normalize a reviewer?

Just barely. With n_u = 2 a reviewer's σ_u has df = 1 and is extremely noisy. We admit them because (a) excluding them would throw away a lot of crowd data, and (b) the noise they introduce is correctly booked at the product-level CI-floor through the 1/√N_eff scaling. Reviewers with larger n_u contribute more stable z-scores, which is exactly what we want.

Q: What happens to reviewers with only one review?

They're on file but excluded from R1, R2, and R3. A single-review reviewer has no personal distribution — μ_u and σ_u are undefined — so there's no z-score to contribute. Their rating is still visible on individual product pages as raw data.

Q: What's the difference between N_raw and N_eff?

N_raw is the count of qualifying reviewers for a product. N_eff accounts for non-uniform weighting via the Kish formula N_eff = (Σw)² / Σw². For R1 and R3, weights are uniform so N_eff = N_raw. For R2 (source-weighted), N_eff ≤ N_raw with equality only if all weights are equal; the ratio tells you how much R2's weighting collapses the effective sample size.

Q: Does R3's imputed σ̃_s bias the ranking?

Slightly, and in a known direction: if a constant-rater would, counterfactually, have used tighter-than-typical variance, their imputed z-score overstates extremity; if they would have used wider variance, it understates. Over many products the bias averages toward zero under a reasonable model of rater calibration. R3 is a secondary lens — we surface it but never make it the primary number.

Q: Does pooling wine and bourbon reviewers into one pool violate df logic?

No. Pooling is a modeling assumption, not a df cheat. We're assuming the same reviewer uses roughly the same personal scale across categories — an assumption that loses a little per-category resolution in exchange for much better σ_u estimates (n_u is larger). We validate the assumption empirically by checking that within-reviewer rating dispersion is similar across categories for cross-category reviewers.

By Ryan Siegal · Founder and Principal

Published 2026-04-24

1. The classical definition

Degrees of freedom counts the dimensionality of a parameter space minus the number of constraints applied to the sample. For a sample x₁, x₂, …, x_n with computed sample mean x̄, the deviations (x_i − x̄) sum to zero:

Σ (x_i − x̄) = 0                  ← one linear constraint
therefore exactly (n − 1) of the (x_i − x̄) are free to vary;
the nth is determined by the constraint.

That constraint is why sample variance divides by n − 1 rather than n. Dividing by n produces an estimator biased low; dividing by n − 1 (Bessel's correction) yields an unbiased estimator of the population variance:

σ̂²  =  (1 / (n − 1)) · Σ (x_i − x̄)²      ← Bessel's correction, E[σ̂²] = σ²

The estimated variance of a sample underestimates the true variance of the population it was drawn from. The correction factor is n/(n−1).
— Bessel, Astronomical notes, 1830s (paraphrased)

2. Where df enters Rankquant's four-step pipeline

Each estimator's df and the consequence when data is thin.
σ_u (reviewer personal SD)	df = n_u − 1. Bessel-corrected. Reviewer normalization requires σ_u > 0, which is automatic once a reviewer has any rating variance across their history.
Admission rule n_u ≥ 2	Minimum df = 1 for σ_u to be defined at all. A reviewer with n_u = 1 has no μ_u / σ_u — they go on file and are excluded from R1 and R2. Reviewers with n_u ≥ 2 but σ_u = 0 are excluded from R1/R2 but included in R3 via imputed σ̃_s.
R1 aggregation df	df = N − 1 for the per-product mean z-score, where N is the count of qualifying reviewers of that product. Used implicitly in SE(R1) = 1/√N (scale factor 1 because z-scores have unit variance by construction).
R2 effective df (Kish)	N_eff = (Σ w_s)² / Σ w_s². When weights are equal N_eff = N; when one source dominates N_eff collapses toward 1. SE(R2) = 1/√N_eff.
R3 aggregation df	df = N' − 1 where N' = \|Q_i ∪ constant-rater reviewers\|. Imputed σ̃_s for constant raters introduces a small downward bias in SE(R3); we document this in the worked example below.

Each estimator's df and the consequence when data is thin.

3. Why reviewer-level normalization (not source-level)

The choice to normalize at the reviewer grain rather than the source grain is a df choice. A wine publication might have 40 staff critics over 20 years; pooling them into one "source distribution" would throw away the fact that each critic uses a different personal scale. Per-reviewer normalization gives us one μ_u and one σ_u per critic — finer-grained, more honest.

The cost: reviewers with few reviews have noisy μ and σ estimates. A critic with n_u = 2 has df = 1 on σ_u, which means their σ_uis essentially a single data point. We admit them anyway — but the z-scores they produce are noisy, and that noise propagates to the product-level aggregate, where it widens the SE and lowers the CI-floor. Thinness is penalized at the product level, not through exclusion.

4. Effective sample size under R2 source weighting

R2's source-weighted aggregation raises a classical survey-statistics question: what's the effective sample size of a non-uniformly-weighted sample? Kish's (1965) design-effect formula is the standard answer:

N_eff  =  ( Σ_u w_s(u) )²  /  Σ_u w_s(u)²

  where w_s(u) is the source-credibility weight for reviewer u's source.

Worked example — a wine with 12 reviewers split across three sources:

Source               w_s   count   contribution to Σw   contribution to Σw²
Wine Advocate         10      2         20                    200
Wine Spectator        10      2         20                    200
Vivino (crowd)         2      8         16                     32
                                       ─────                 ─────
                                       Σ w  = 56            Σ w² = 432

N_eff = 56² / 432  =  3136 / 432  ≈  7.26

(N_raw = 12; weighting collapses effective size by ~40%.)

So R2's CI-floor uses SE(R2) ≈ 1/√7.26 ≈ 0.371 for this product — not 1/√12 ≈ 0.289. The weighted aggregate rewards the professional sources' credibility but pays a variance-inflation cost that the CI-floor correctly books.

5. Why R3 broadening is a df trade

R3 admits constant-rater reviewers (n_u ≥ 2 but σ_u = 0) by imputing σ̃_s— the pooled SD of their source's variance-producing reviewers. This adds reviewers (more df on the product-level mean) at the cost of slightly biasing the imputed z-scores. The bias is bounded: imputed σ̃_sis always the source's typical variance, so the imputed z-score is the reviewer's rating expressed in source-typical units. Over large samples this converges to an unbiased estimator of the product's relative quality under the assumption that constant raters would, if they expressed opinions, use source-typical dispersion.

R3 only exists because the information is otherwise wasted.A reviewer who has rated 12 wines all at 90 points is not useless — they have clearly signalled something about those 12 wines. R3 extracts that signal; R1 and R2 throw it away. When R3 diverges meaningfully from R1, that's a finding in itself and our tagline reports it.

6. The t-vs-z question under per-reviewer normalization

A standard statistics instinct says: "for small N, use Student's t instead of zfor the critical value." Under per-reviewer normalization, that instinct mostly doesn't apply. The z-scores z_u,iare already approximately unit-variance by construction. The sampling distribution of the mean z-score is asymptotically normal under the Central Limit Theorem, with small-N departures driven by (a) reviewer-σ estimation noise and (b) skew in the underlying raw-rating distribution.

We confirmed via simulation that for N ≥ 6 (the minimum R1 sample size we admit), the 90% CI coverage of Ẑ − 1.645·(1/√N) is within 1.5 percentage points of nominal on realistic review distributions. For N < 6 we flag the product as "limited coverage" and show the CI-floor with a visible warning.

SE(Ẑ)  =  1 / √N_eff

  N_eff = 4     →  SE ≈ 0.500    (limited-coverage flag shown)
  N_eff = 6     →  SE ≈ 0.408    (minimum admitted for R1 CI-floor)
  N_eff = 30    →  SE ≈ 0.183    (acceptable; coverage near nominal)
  N_eff = 100   →  SE ≈ 0.100    (comfortable)
  N_eff = 1000  →  SE ≈ 0.032    (floor essentially equals mean)

Frequently asked questions

Why does sample variance divide by n−1 instead of n?+

Because the sample mean x̄ is estimated from the same data. When you use the sample mean as your center, the deviations (x_i − x̄) are constrained to sum to zero, so only n−1 of them vary independently. Dividing by n−1 (Bessel's correction) gives an unbiased estimator; dividing by n underestimates true variance.

Is n_u = 2 really enough to normalize a reviewer?+

Just barely. With n_u = 2 a reviewer's σ_u has df = 1 and is extremely noisy. We admit them because (a) excluding them would throw away a lot of crowd data, and (b) the noise they introduce is correctly booked at the product-level CI-floor through the 1/√N_eff scaling. Reviewers with larger n_u contribute more stable z-scores, which is exactly what we want.

What happens to reviewers with only one review?+

They're on file but excluded from R1, R2, and R3. A single-review reviewer has no personal distribution — μ_u and σ_u are undefined — so there's no z-score to contribute. Their rating is still visible on individual product pages as raw data.

What's the difference between N_raw and N_eff?+

N_raw is the count of qualifying reviewers for a product. N_eff accounts for non-uniform weighting via the Kish formula N_eff = (Σw)² / Σw². For R1 and R3, weights are uniform so N_eff = N_raw. For R2 (source-weighted), N_eff ≤ N_raw with equality only if all weights are equal; the ratio tells you how much R2's weighting collapses the effective sample size.

Does R3's imputed σ̃_s bias the ranking?+

Slightly, and in a known direction: if a constant-rater would, counterfactually, have used tighter-than-typical variance, their imputed z-score overstates extremity; if they would have used wider variance, it understates. Over many products the bias averages toward zero under a reasonable model of rater calibration. R3 is a secondary lens — we surface it but never make it the primary number.

Does pooling wine and bourbon reviewers into one pool violate df logic?+

No. Pooling is a modeling assumption, not a df cheat. We're assuming the same reviewer uses roughly the same personal scale across categories — an assumption that loses a little per-category resolution in exchange for much better σ_u estimates (n_u is larger). We validate the assumption empirically by checking that within-reviewer rating dispersion is similar across categories for cross-category reviewers.

Next: The 90% CI-floor ranking →