how do you actually catch your agent breaking in prod before users do? [D]

we run an agent thing in production and we use langfuse for traces.

last month our agent started refusing requests it should have answered. took us almost a week to notice. evals were all green. traces looked normal because each call by itself was fine. we found out from support tickets piling up.

now i'm looking at our setup and i'm like, what does this stack actually do when things go bad? answer: nothing. it just records stuff. someone has to notice, dig through traces, write a new eval, push a fix. all manual.

so i wanted to ask:

when your agent quietly starts doing the wrong thing, how do you find out? alerts? users yelling?
does anything in your stack actually take action when quality drops, or do you also just page a human?
for people running more than a million calls a day, are you tracing everything or sampling? if sampling, how do you not miss weird edge cases?

i keep seeing names like raindrop that claim they auto generate evals from prod. anyone actually using these in real production? do they work or is it marketing?

not looking for a list of tools. just want to know what actually works for you and what doesn't.

submitted by /u/BriefCardiologist656
[link] [comments]

Leave a Comment