Every time one of our agents made a bad call in production, we had the same problem. Logs showed what happened. Nothing showed why it decided to do it.
Turns out the gap is decision context. Tools capture inputs and outputs. Nobody captures what the agent believed when it made the call, what policy it was running, what it was weighing.
We ended up building something to capture exactly that, full context at every agent decision, root cause in 30 seconds instead of hours.
How are you all handling this right now? Curious what setups people are using in production.
[link] [comments]