Building better AI benchmarks: How many raters are enough?
Researchers at Google propose that 3-5 human raters are sufficient for reliable AI benchmark evaluation, based on analysis of 11 existing benchmarks. The study finds that while more raters can improve consistency, the law of diminishing returns applies, and 3-5 raters provide a good tradeoff between cost and reliability. This has implications for builders evaluating and fine-tuning AI models, as it suggests that a smaller, more focused set of raters can be used without sacrificing accuracy. You
- 3-5 human raters sufficient for reliable AI benchmark evaluation.
- More raters improve consistency but costs increase.
- Diminishing returns apply to rater count.