Bayesian averaging is the right tool for review aggregation

By Ryan Siegal · Founder and Principal

Published 2026-04-23

The problem: raw averages lie when samples are thin

Raw averages are mathematically correct but operationally useless at small sample sizes. A product with three reviews, all 5-star, averages 5.0 — higher than an industry-leading product with 10,000 reviews averaging 4.7. The raw numbers say the tiny-sample product is better. Almost no human would agree.

This exact failure mode is what drives fake-review farms: buy 20 5-star reviews on a new listing and you can temporarily rank ahead of established competitors on raw average until organic reviews dilute the signal.

The Bayesian fix

Bayesian adjustment treats each product's true quality as an unknown parameter we're trying to estimate. Before seeing any reviews, our prior belief is that it's probably somewhere near the category average. Each new review updates that belief a little. Three reviews update our belief a little; 10,000 reviews update it dramatically.

The math is a weighted average between the product's raw mean and the category prior:

μ_P'  =  (μ_P · N_P  +  μ_C · k)  /  (N_P + k)

  μ_P   = product's raw mean
  N_P   = total number of reviews of the product
  μ_C   = category mean (the prior)
  k     = prior strength (number of "virtual" category-mean reviews mixed in)

When N_P is small relative to k, the prior dominates and the adjusted mean sits near the category average. When N_Pis large, the prior's contribution becomes negligible and the adjusted mean approaches the raw mean.

k = 30

Rankquant's default prior strength. A product needs roughly 30 total reviews before the Bayesian prior stops pulling it materially.

Rankquant methodology, version 1.0

2008

The year IMDb publicly documented its use of Bayesian-adjusted means (prior strength m ≈ 25,000 votes) to compute the Top 250 ranking.

IMDb Top 250 formula documentation

How IMDb does it

IMDb's Top 250 formula is the most famous deployed example of this technique in consumer software:

weighted rating = (v / (v+m)) · R  +  (m / (v+m)) · C

  R = average rating of the film
  v = number of votes for the film
  m = minimum votes required to be listed (≈25,000 historically)
  C = mean vote across the whole Top 250 pool

Films with fewer than m votes are pulled toward the pool-wide mean C. This stops a brand-new film with 10 passionate 10/10 votes from dominating the list. The formula is literally the Bayesian mean with k = m.

The Bayesian true Bayesian estimate ensures that films with too few votes don't make it onto the Top 250 on the basis of unreliable data.
— IMDb Top 250 formula (public, 2008)

Why k matters and how we pick it

The prior strength k is a knob: larger values make the system more conservative (new products have to earn their position with more reviews); smaller values let high-quality-but-new products rise faster but also let fake-review farms rise faster.

Rankquant uses k = 30as the default across categories because review aggregation empirically stabilizes around 30 reviews for most consumer products — that's where the sampling variance of the mean drops low enough that you're estimating a real thing, not noise. Categories with higher natural review volumes (movies, books, Amazon electronics) could use larger k; categories with thinner review data (boutique wines, independent films) benefit from the default.

Any change to k triggers a public version bump of the methodology. Historical scores remain available at the k they were computed under.

Bayesian averaging doesn't solve rating inflation on its own

Important caveat: Bayesian adjustment fixes the thin-sample manipulation problem but does NOT fix the broader rating-inflation problem. If the category mean itself is inflated (almost everything is 4.2+), the adjusted mean is also inflated — just more stable.

That's why Rankquant pairs Bayesian adjustment with within-category z-scoring and percentile remapping. Bayesian handles thin samples; z-scoring handles inflation; percentile mapping puts the output back on a usable 1-5 scale. All three steps are required for a score that carries real information. Full method documented at /methodology.

Frequently asked questions

Is Bayesian averaging the same as a "weighted average"?+

Technically yes — it's a weighted average of the raw product mean and the category prior, with weights proportional to sample size. The Bayesian framing is useful because it makes explicit that we're estimating an unknown parameter (true quality), which clarifies when to use stronger vs. weaker priors.

Does k = 30 favor popular products over niche ones?+

Slightly, yes. A niche product with 10 carefully-considered reviews will be pulled toward the category mean more than a popular product with 1000 less-rigorous reviews. The trade-off is accepting some under-ranking of high-signal small-sample products in exchange for defending against low-signal large-sample manipulation. Adjustable in the open-source repo.

Why not just display a confidence interval around the average?+

We do, on every review page (as σ-distance). But confidence intervals are bad UX for quick decisions — consumers need a single ranked number. Bayesian-adjusted means are the right aggregate scalar; confidence bands are shown alongside for users who want the uncertainty information.

What's the difference between this and a "credibility-weighted" average?+

They're closely related. Credibility theory (used in insurance pricing) is essentially applied Bayesian inference — the "credibility factor" is the N_P / (N_P + k) term in our formula. Same math, different vocabulary.