Bayesian averaging is the right tool for review aggregation
By Ryan Siegal · Founder and Principal
The problem: raw averages lie when samples are thin
Raw averages are mathematically correct but operationally useless at small sample sizes. A product with three reviews, all 5-star, averages 5.0 — higher than an industry-leading product with 10,000 reviews averaging 4.7. The raw numbers say the tiny-sample product is better. Almost no human would agree.
This exact failure mode is what drives fake-review farms: buy 20 5-star reviews on a new listing and you can temporarily rank ahead of established competitors on raw average until organic reviews dilute the signal.
The Bayesian fix
Bayesian adjustment treats each product's true quality as an unknown parameter we're trying to estimate. Before seeing any reviews, our prior belief is that it's probably somewhere near the category average. Each new review updates that belief a little. Three reviews update our belief a little; 10,000 reviews update it dramatically.
The math is a weighted average between the product's raw mean and the category prior:
μ_P' = (μ_P · N_P + μ_C · k) / (N_P + k)
μ_P = product's raw mean
N_P = total number of reviews of the product
μ_C = category mean (the prior)
k = prior strength (number of "virtual" category-mean reviews mixed in)When N_P is small relative to k, the prior dominates and the adjusted mean sits near the category average. When N_Pis large, the prior's contribution becomes negligible and the adjusted mean approaches the raw mean.
Rankquant's default prior strength. A product needs roughly 30 total reviews before the Bayesian prior stops pulling it materially.
Rankquant methodology, version 1.0
The year IMDb publicly documented its use of Bayesian-adjusted means (prior strength m ≈ 25,000 votes) to compute the Top 250 ranking.
IMDb Top 250 formula documentation
How IMDb does it
IMDb's Top 250 formula is the most famous deployed example of this technique in consumer software:
weighted rating = (v / (v+m)) · R + (m / (v+m)) · C
R = average rating of the film
v = number of votes for the film
m = minimum votes required to be listed (≈25,000 historically)
C = mean vote across the whole Top 250 poolFilms with fewer than m votes are pulled toward the pool-wide mean C. This stops a brand-new film with 10 passionate 10/10 votes from dominating the list. The formula is literally the Bayesian mean with k = m.
The Bayesian true Bayesian estimate ensures that films with too few votes don't make it onto the Top 250 on the basis of unreliable data.
Why k matters and how we pick it
The prior strength k is a knob: larger values make the system more conservative (new products have to earn their position with more reviews); smaller values let high-quality-but-new products rise faster but also let fake-review farms rise faster.
Rankquant uses k = 30as the default across categories because review aggregation empirically stabilizes around 30 reviews for most consumer products — that's where the sampling variance of the mean drops low enough that you're estimating a real thing, not noise. Categories with higher natural review volumes (movies, books, Amazon electronics) could use larger k; categories with thinner review data (boutique wines, independent films) benefit from the default.
Any change to k triggers a public version bump of the methodology. Historical scores remain available at the k they were computed under.
Bayesian averaging doesn't solve rating inflation on its own
Important caveat: Bayesian adjustment fixes the thin-sample manipulation problem but does NOT fix the broader rating-inflation problem. If the category mean itself is inflated (almost everything is 4.2+), the adjusted mean is also inflated — just more stable.
That's why Rankquant pairs Bayesian adjustment with within-category z-scoring and percentile remapping. Bayesian handles thin samples; z-scoring handles inflation; percentile mapping puts the output back on a usable 1-5 scale. All three steps are required for a score that carries real information. Full method documented at /methodology.
Frequently asked questions
Is Bayesian averaging the same as a "weighted average"?+
Does k = 30 favor popular products over niche ones?+
Why not just display a confidence interval around the average?+
What's the difference between this and a "credibility-weighted" average?+
Related: What is rating inflation · The full methodology