From Chaos to Causality: Debugging Multi-Agent Systems

Co-authored by Rodrigo Araujo, Mansura H., and Michal Ulewicz

Debugging a multi-agent system by looking at each one of the traces seems like an impossible task. So many different scenarios can go wrong: wrong tool calls, wrong routing agents, summarizing agents that leave important information out.

The irony is that we’re debugging agentic systems, and while we’re not fully relying on them, they can still play a role in helping debug themselves.

Traces are just logs, and it requires a lot of brain energy to go from outputs to actually come up with the agent or tool that caused the poor answer, and explain why that happened. But as humans, we are very good at imagining scenarios and reimagining what would have happened if that router agent had called the right agent instead of that one. And that’s one way to find the “why” — we are good at it, but it takes energy.

I come from the Machine Learning world, where most things are grounded in association — observing patterns and correlations. So every time we made a prediction, we couldn’t really explain the “why” behind it. Instead, we’d say things like “this variable is highly influential” or “this feature is strongly associated with the outcome.”

However, Judea Pearl changed that. With frameworks like Structural Causal Models and do-calculus, we can actually use the word “because” when answering “why” questions.

I picture of a book called “The Book of Why”

When using Pearl’s framework, your explanation becomes:

“Variable X is a direct cause of the outcome. Based on the causal model, intervening on X would change the result by x amount.”

Traces tell you what happened — Not why it failed or what caused the problem.

What If We Treated Failures as a Causal Problem?

We were always told that correlation is not causation. However, there are ways to make causal claims — at least within the context of a single execution episode (we are not claiming population-level causal effects).

“Given that this is what happened, would the failure still have occurred if agent X had produced a different output?”

This type of question is what we call a counterfactual in the language of causal inference. You can read more about it in The Book of Why or in the broader causal inference literature.

Here’s a quick intuition. Judea Pearl introduced a conceptual framework called the Ladder of Causality to describe different levels of reasoning about cause and effect.

At the base, we have seeing (association), where we simply observe patterns and correlations.
The second level is doing (intervention), where we ask what happens if we actively change something — like in controlled experiments.
The highest level is imagining (counterfactuals), where we consider alternate realities.

It’s this final level that allows us to truly reason about cause and effect — the “why” behind outcomes. With this proposed engine, our goal is to operate at this level. In the next section, we’ll walk through how that works in practice.

Our engine distinguishes causation from correlation by:

Extracting multi-modal signals from user feedback, behavior, and execution patterns
Computing preliminary attribution through graph-based analysis
Validating with counterfactual replay: “If we changed this component, would the failure be prevented?”
Fusing evidence from multiple sources into a final diagnosis

Let’s dig a bit deeper on each one of those components.

The Mental Model: Episode Graph

The foundation for the analysis relies on the construction of a episode graph. Instead of treating execution as a list of steps, we model it as a graph.

*Figure 2* — *Example of an episode graph*

An episode graph represents a single execution trace as a structured network:

Nodes = components in the system

agents
tools
intermediate reasoning steps

Edges = information flow

who influenced whom
which output was consumed downstream

A Simple Example

Consider a failure scenario:

The router selects a generic agent
A specialist agent is never used
The final response lacks critical domain knowledge

In a linear trace, this looks like a normal flow.

In an episode graph, something becomes immediately visible:

The specialist node exists but is not used
The router decision becomes a critical branching point
The final response depends only on weak upstream signals

👉 The failure is no longer just “a bad answer”
👉 It becomes a structural issue in the graph

With an episode graph in place, we can start to enable other supporting processes to help us reach an eventual root cause of a failure if there is one.

Multimodal Signal Extraction

There are many signals that we can use to feed our attribution engine later one. We can divide them into 3 types:

Explicit feedback

The first and most obvious one are the explicits feedbacks from users. Most important are:

Dissatisfaction scores from user ratings (e.g. thumbs up/down or 5 star rating)
Comments from user feedback

Behavioural Signals

These are inferred from user interaction patterns, it is a way to infer dissatisfaction or confusion without explicit feedback. This is crucial because users often don’t provide explicit ratings but their behaviour reveals problems. Some examples are:

Quick retry (≤ 60 seconds between response and next user input)
Rephrasing question (lexical similarity from previous question)
Human escalation
Follow up confusion

User: "How do I reset my password?"

Agent: "You can manage your account settings in the profile section."

User: "What are the steps to change my password?" ← Rephrase detected (75% similar)

User: "How long does shipping take?"

Agent: "We offer various shipping options with different speeds."

User: "That's not what I asked. How many days for standard shipping?" ← Confusion detected

Outcome signal and Semantic Signals

These are combination of signals that analyze failure patterns, check task completion combined with LLM-based semantic analysis to detect hallucination, reasoning quality, etc.

A bunch of other signals that combined can inform us about level of dissatisfaction and an initial list of flags to be analyzed later. This is all in preparation for the Attribution Engine to work.

So far we have combined all these signals, so what’s next.

Computing Initial Priors (Who’s Usually Responsible?)

Not all components are equally likely to cause failures. We start with low priors of could be the culprit:

BASE_ROLE_PRIORS = {
    "synthesizer": 0.05,      # Often responsible (combines info)
    "router": 0.05,           # Can misdirect tasks
    "generic_agent": 0.04,    # General-purpose agents
    "specialist_agent": 0.04, # Domain-specific agents
    "tool": 0.03,             # External tools
    "agent": 0.03,            # Generic agents
    "other": 0.02,            # Unknown components
}

We than dynamically adjust those priors based on the signals we collect.

# If task failed
if has_failure:
    synthesizer_score += 0.08  # Synthesizer gets +8%
    router_score += 0.03       # Router gets +3%
# If user is dissatisfied
if has_low_satisfaction:
    synthesizer_score += 0.05  # Synthesizer gets +5%
# If user requested escalation
if has_escalation:
    synthesizer_score += 0.03  # Synthesizer gets +3%
    router_score += 0.02       # Router gets +2%
# If tool explicitly failed
if tool_status == "error":
    tool_score += 0.30         # Tool gets +30% (clear failure!)

An Example of Bootstrapping Initial Priors

Why synthesizers is getting higher priors? They’re the final step before the user sees the response. If something’s wrong with the output, the synthesizer often bears responsibility for not catching it.

But priors are only the starting point. They tell us which components look more suspicious in general based on their role and the signals observed in the episode. They do not yet explain how responsibility moves through the chain of execution.

To do that, we need to follow the dependencies between components. A bad final answer may not have originated in the synthesizer itself — it may have been inherited from a router decision, a tool failure, or an earlier agent output. That is where backward propagation comes in.

Backward Propagation (Following the Trail)

Here’s where it gets interesting. We propagate responsibility backward through the graph from the final response.

The Intuition

If the final response is bad, and it came from the synthesizer, and the synthesizer used output from a tool, then the tool shares some responsibility.

But how much? That depends on:

Edge weight — How strong is the connection?
Role multiplier — How important is this component?
Distance decay — How far away is it?
Propagation strength — How much responsibility flows backward?

The Formula

contribution = (
    current_node_score
    × edge_weight
    × role_multiplier
    × distance_decay
    × propagation_strength
)

source_node_score += contribution

I will describe some of those components and the intuition of the others.

Edge Weights

Different relationships carry different weights:

EDGE_WEIGHTS = {
    "produces": 1.00,        # Direct output (strongest)
    "uses": 0.90,            # Direct usage
    "provided_to": 0.45,     # Indirect provision
    "parent_child": 0.25,    # Hierarchical
    "input_to_trace": 0.20,  # Initial input
    "evaluated_by": 0.00,    # No causal flow
}

Why “produces” is 1.00? If Agent A produces the output that becomes the final response, and the final response is bad, Agent A is directly responsible.

Distance Decay

Responsibility decreases with distance:

DISTANCE_DECAY = {
0: 1.00, # Direct connection (100%)
1: 1.00, # One hop away (100%)
2: 0.82, # Two hops away (82%)
3: 0.68, # Three hops away (68%)
4: 0.55, # Four hops away (55%)
}

Why decay? Components far from the final response have less direct influence. A tool called early in the execution has less impact on the final output than the synthesizer that directly produces it.

Propagation Strength

We use a propagation strength of 0.35 (35%). This means:

35% of responsibility flows backward through each edge
65% stays with the current node

Why 35%? It balances local vs. upstream responsibility. Too high (e.g., 90%) and everyone gets blamed equally. Too low (e.g., 10%) and only the final node gets blamed.

The Propagation Algorithm

Think of this like tracing a rumour back to its source. If the final answer is wrong, we ask: “Where did this information come from?” Then we trace backward through each connection, sharing blame along the way.

The synthesizer produced it. So it carries some responsibility.
But the synthesizer didn’t invent everything — it relied on other components. So we go one step back.
Then another.
And another.

At each step, we trace the rumour backward through the system, sharing responsibility along the way.

But not all sources are equally guilty.

If one component directly produced the information, it gets more blame. If it’s further away, its influence is weaker. If it plays a critical role (like a synthesizer or router), we hold it to a higher standard. And if there are strong signals — like a tool failure or clear user dissatisfaction — that shifts suspicion even more.

So instead of asking “who was involved?”, we ask something more useful:

Who most influenced the bad outcome?”

By the end, we don’t get a single culprit — we get a distribution of responsibility across the system.

Figure 3 — Graph-Based Attribution Algorithm example

But wait theres is more…

We’ve talked about the Ladder of Causality, and so far, even though it may feel like we’re going the extra mile — collecting different signals, adjusting responsibility scores, even using LLMs to semantically refine those signals and create flags for potential issues — we’re still operating at the first level of the ladder.

And I promised you a taste of causality.

So When Do We Reach True Causality?

We cross into real causal reasoning when we start asking questions like:

“What if this router had chosen a different agent?”
“What if we removed this tool from the execution?”

That’s no longer just analyzing what happened — that’s imagining changes to the system and observing their effects. (We’re not actually changing anything — we’re reasoning about what would have happened.)

And that’s exactly what the next stage of the pipeline does.

Counterfactual Replay Engine

Think of this as the moment we stop guessing — and start testing.

Up to this point, we’ve done a pretty good job identifying suspects. The attribution step looked at the execution trace and said:

“Something went wrong… and these components are the most likely responsible.”

But suspicion isn’t proof.

This is where the Counterfactual Replay Engine comes in.

From “Who might be responsible?” to “What if we fix it?”

The core idea is simple:

If we believe a component caused the failure, what would happen if it had worked correctly?=

Would the final answer improve?
Would the failure disappear?
Or would nothing really change?

Instead of debating this theoretically, the system simulates it.

Setting the Stage: The Replay Context

Before we can run these “what if” scenarios, we need to carefully prepare the experiment.

At this stage, the system:

Gathers everything we’ve learned so far — the execution graph, signals, and attribution scores
Identifies the most suspicious nodes (our main candidates)
Understands the type of failure we’re dealing with (hallucination, missing evidence, bad routing, etc.)
Packages all of this into a clean, structured context for replay

At this point, we’ve moved from simply analyzing what happened to preparing controlled “what if” experiments.

At this point, we’ve moved from:

“This node looks guilty”

“Let’s test what happens if we fix this node.”

Running the Experiment: Graph Surgery

Now comes the interesting part.

The replay engine performs what you can think of as graph surgery.

We go back to the execution graph and intervene:

Replace a bad output with a correct one
Inject missing evidence
Fix a routing decision
Remove a harmful contribution

In causal terms, this is a do-intervention:

We are no longer observing what happened — we are actively changing it.

Letting the Effects Flow

Once the intervention is applied, we let the system “run forward” again.

Not fully re-executing everything, but propagating the impact through the graph:

If we fix a retrieval step, does the downstream reasoning improve?
If we correct a summary, does the final answer become more accurate?
If we remove an agent’s contribution, does the outcome get better or worse?

This is where causality becomes visible — through impact.

Measuring the Difference

Now we compare:

Original outcome vs. Counterfactual outcome

And we look at the delta:

Did the failure severity decrease?
Did user dissatisfaction improve?
Did key semantic issues (like hallucination or missing evidence) disappear?

The bigger the improvement, the stronger the evidence that:

“Yes — this component wasn’t just involved… it actually caused the problem.”

From Suspects to Culprits

This is the key shift:

Attribution gives us suspects
Replay gives us causal validation

A component with high responsibility but low counterfactual impact?

→ Probably a bystander.

A component where fixing it dramatically improves the outcome?

→ That’s your culprit.

Why This Matters

In complex multi-agent systems, everything is connected.

A bad answer can be the result of many small issues interacting.

Without counterfactual reasoning, we’re left with educated guesses.

But with replay, we can ask:

“If this part had worked differently, would the story have changed?”

And that’s the essence of causality.

Where This Fits in the Ladder of Causality

This is where we finally move beyond observation.

Attribution lives at Level 1 (Association) — spotting patterns
Replay moves us into Level 2 (Intervention) — testing changes
And when we ask “Would this failure have been avoided?” we’re stepping into Level 3 (Counterfactuals)

This is the moment the system stops being descriptive…

and starts becoming explanatory.

The Analogy

If attribution is like tracing a rumour back to its sources,

then counterfactual replay is asking:

“What if that person had stayed silent… would the rumor still exist?”

That’s how you find the real cause.

Conclusion

Debugging multi-agent systems can feel like chasing shadows. So many moving parts, so many interactions — and when something goes wrong, it’s rarely obvious why.

We tried to come up with something that could bring a bit of structure to the chaos.

By combining attribution, graph-based reasoning, and counterfactual replay, we move beyond simply observing failures — we start explaining them.

Not just what happened.

But why it happened.

And more importantly, what would have changed the outcome.

References:

Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect.

Meet the Authors

Rodrigo Araujo — AI Technical Advocate

Mansura Habiba — Principal Architect, Project Solis

Michał Ulewicz — AI Technical Advocate

From Chaos to Causality: Debugging Multi-Agent Systems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.