Simple Rules To Follow When Building Multi-Agent Systems.

When designing multi-agent systems, most failures come from the architecture, not the model. Agents start hallucinating tool calls, pipelines produce unstructured garbage, and bugs become nearly impossible to trace. If you’re building agentic systems, or planning to, these four rules will save you a lot of pain.

One agent, one responsibility

The whole idea behind building multi-agent systems is to develop a division of labour system where individual agents are tasked with doing one thing and doing it well. But in practice, people violate this unintentionally, usually via overly powerful tools with functionality that goes outside the scope of that agent. This breaks the architecture, creates misunderstanding, and makes behavior harder to reason about.

To better illustrate, imagine an e-commerce pipeline where an inventory_agent checks stock levels, and a pricing_agent calculates the final price for a customer's cart. If you give inventory_agent access to a calculate_discount tool because "it's convenient," you've now created an agent that both checks inventory and applies pricing logic. When a discount is applied incorrectly, you can't tell which agent is responsible. The architecture has broken down.

Here’s what a correctly written version of this flow above looks like:

from google.adk.agents import LlmAgent, SequentialAgent

# Checks whether requested items are available
inventory_agent = LlmAgent(
name="InventoryAgent",
instruction="""
Check whether the items in the customer's cart are in stock.
For each item, return its availability status and available quantity.
Do not calculate prices or apply discounts.
""",
tools=[check_stock_levels], # only touches inventory
output_key="inventory_status"
)

# Calculates final price, including applicable discounts
pricing_agent = LlmAgent(
name="PricingAgent",
instruction="""
Given the inventory status in {inventory_status}, calculate the
final price for each available item. Apply any applicable
discounts. Do not check or modify stock levels.
""",
tools=[get_base_price, apply_discount_rules], # only touches pricing
output_key="final_pricing"
)

checkout_pipeline = SequentialAgent(
name="CheckoutPipeline",
sub_agents=[inventory_agent, pricing_agent]
)

With this pattern, each agent has a single concern, and each tool list enforces that boundary. When something breaks, you know exactly where to look.

Always define schemas for tool inputs and outputs

Sometimes, this step is skipped when we try to build fast, but we will always end up paying for it later with mysterious agent failures. The more inference you carry out on your multi-agent system, the more you start to see reason to create a contract (in the name of type definitions/schemas) between your agent and their tools, because you want a guarantee that a given tool expects a specific input shape and always returns a specific output shape.

In Python, this is straightforward with Pydantic. For example, a search_order tool for a customer support agent, where a missing or malformed order ID would cause downstream agents to fail silently.

Without schemas, the tool accepts anything and returns anything, which is a recipe for unpredictable behaviour at runtime.

# Without schemas == fragile and unpredictable
def search_order(order_id, customer_email=None):
result = db.query(order_id)
return result # could be None, a dict, a list, who knows?

Using Pydantic, inputs can be validated before the tool runs, outputs are always structured, and failure modes become explicit:

from typing import Optional, Literal
from pydantic import BaseModel, Field, field_validator
import re


class SearchOrderInput(BaseModel):
"""Input schema for the search_order tool."""

order_id: str = Field(
...,
description="The order ID to look up. Must follow format ORD-XXXXXXXX.",
min_length=3,
max_length=20,
)
customer_email: Optional[str] = Field(
None,
description="Optional customer email for cross-verification.",
)

@field_validator("order_id")
@classmethod
def validate_order_id_format(cls, v: str) -> str:
if not re.match(r"^ORD-[A-Z0-9]{8}$", v.upper()):
raise ValueError(
f"Invalid order ID format: '{v}'. Expected format: ORD-XXXXXXXX"
)
return v.upper()


class OrderResult(BaseModel):
"""Structured output for a single matched order."""

order_id: str
status: Literal["pending", "processing", "shipped", "delivered", "cancelled"]
items: list[str]
total_amount: float
estimated_delivery: Optional[str]


class SearchOrderOutput(BaseModel):
"""Output schema for the search_order tool."""

found: bool = Field(..., description="Whether a matching order was found.")
order: Optional[OrderResult] = Field(
None, description="The matched order, if found."
)
error_message: Optional[str] = Field(
None, description="Human-readable error message if the lookup failed."
)


def search_order(order_id: str, customer_email: Optional[str] = None) -> dict:
"""Looks up an order by ID, with optional email verification.

Args:
order_id: The order ID to look up.
customer_email: Optional customer email for cross-verification.

Returns:
dict: Structured search result matching SearchOrderOutput schema.
"""
try:
validated = SearchOrderInput(order_id=order_id, customer_email=customer_email)
except Exception as e:
return SearchOrderOutput(
found=False,
error_message=f"Invalid input: {str(e)}"
).model_dump()

record = order_db.find(validated.order_id, email=validated.customer_email)

if not record:
return SearchOrderOutput(
found=False,
error_message=f"No order found with ID {validated.order_id}."
).model_dump()

return SearchOrderOutput(
found=True,
order=OrderResult(
order_id=record["id"],
status=record["status"],
items=record["items"],
total_amount=record["total"],
estimated_delivery=record.get("eta"),
)
).model_dump()

Now, the agent calling this tool always gets back a predictable structure. Downstream agents know exactly what fields to expect. If a bad order ID somehow comes in, the error is dealt with at the boundary rather than cascading silently through the pipeline.

Log every tool call and agent routing decision

When building a multi-agent system using any of the popular frameworks like langgraph and ADK, you always want to know what action is being carried out, because a lot has already been abstracted away by the framework, and the only way to achieve this is through logging. Without this, you can’t see which agent handled a request, which tool was called, what arguments were passed, or where in the pipeline a failure occurred.

During development, you don’t need a complex observability setup. Well-placed print statements or Python’s logging module are enough to give you visibility into what's happening. The key is to log at the right points, e.g., at tool entry, after input validation, after external calls, and at tool exit.

To illustrate, see an example of a get_order_context tool instrumented with meaningful logging:

import logging
from google.adk.tools.tool_context import ToolContext
from app.schemas import GetOrderContextInput, OrderContextOutput

logger = logging.getLogger(__name__)


def get_order_context(customer_id: str, tool_context: ToolContext) -> str:
"""Fetches recent order history to provide context for a support query.

Args:
customer_id: The customer's unique identifier.
tool_context: ADK tool context providing access to session state.

Returns:
str: Formatted order history for use by the support agent.
"""
logger.info("Tool 'get_order_context' called | customer_id=%s", customer_id)

# Validate input
try:
validated = GetOrderContextInput(customer_id=customer_id)
except Exception as e:
logger.error("Input validation failed | customer_id=%s | error=%s", customer_id, e)
return f"Error: Invalid input — {str(e)}"

# Retrieve required session state
api_key = tool_context.state.get("orders_api_key")
if not api_key:
logger.error(
"Missing 'orders_api_key' in session state | customer_id=%s", customer_id
)
return "Error: API key configuration missing. Cannot fetch order history."

locale = tool_context.state.get("locale", "en-US")
logger.debug("Session state | locale=%s | customer_id=%s", locale, customer_id)

# Fetch order history
try:
orders = order_service.get_recent_orders(
customer_id=validated.customer_id,
api_key=api_key,
limit=5,
)
logger.info(
"Order fetch successful | customer_id=%s | order_count=%d",
customer_id,
len(orders),
)
except order_service.TimeoutError:
logger.warning("Order service timed out | customer_id=%s", customer_id)
return "Error: Order service is temporarily unavailable. Please try again."
except Exception as e:
logger.exception("Unexpected error fetching orders | customer_id=%s", customer_id)
return f"Error: Failed to retrieve orders: {str(e)}"

output = OrderContextOutput(orders=orders, locale=locale)
logger.info("Tool 'get_order_context' completed | customer_id=%s", customer_id)
return output.formatted_summary()

Notice the logging covers four distinct moments: tool entry, validation failure, the external service call result, and tool exit. Each log line includes enough context (customer ID, counts, locale) to reconstruct exactly what happened, without having to reproduce the failure.

For serious production systems, you definitely want to go further, like adding a structured JSON logging, correlation IDs that trace a request through every agent in the pipeline, and bringing in tools like OpenTelemetry or the Grafana stack for aggregation and alerting. But still, even the approach above is transformative compared to flying blind.

Write tests for tools, for routing, for schemas

When you don’t have proper test coverage, and your pipeline is growing quickly, and suddenly something breaks, the difference between “I know exactly what went wrong” and “I’m guessing for two hours” is test coverage. Without it, you can’t tell whether a wrong response came from a failed tool call, a misconfigured agent, a schema mismatch, or a routing error. You start playing an expensive guessing game.

Remember our search_order tool from earlier, here's what a proper test suite looks like using pytest:

import pytest
from unittest.mock import patch, MagicMock
from app.tools import search_order


class TestSearchOrderTool:
"""Tests for the search_order customer support tool."""

def test_returns_order_when_found(self, mock_order_db):
mock_order_db.find.return_value = {
"id": "ORD-AB12CD34",
"status": "shipped",
"items": ["Blue Sneakers (Size 10)", "White Laces"],
"total": 89.95,
"eta": "2024-12-20",
}

result = search_order(order_id="ORD-AB12CD34")

assert result["found"] is True
assert result["order"]["order_id"] == "ORD-AB12CD34"
assert result["order"]["status"] == "shipped"
assert len(result["order"]["items"]) == 2

def test_returns_not_found_for_missing_order(self, mock_order_db):
mock_order_db.find.return_value = None

result = search_order(order_id="ORD-ZZ99ZZ99")

assert result["found"] is False
assert result["order"] is None
assert "ORD-ZZ99ZZ99" in result["error_message"]

def test_rejects_malformed_order_id(self):
result = search_order(order_id="invalid-id-format")

assert result["found"] is False
assert "Invalid input" in result["error_message"]

def test_rejects_empty_order_id(self):
result = search_order(order_id="")

assert result["found"] is False
assert result["order"] is None

def test_output_always_has_required_keys(self, mock_order_db):
mock_order_db.find.return_value = None

result = search_order(order_id="ORD-AB12CD34")

assert "found" in result
assert "order" in result
assert "error_message" in result

def test_handles_db_timeout_gracefully(self, mock_order_db):
mock_order_db.find.side_effect = TimeoutError("DB connection timed out")

result = search_order(order_id="ORD-AB12CD34")

assert result["found"] is False
assert result["error_message"] is not None

@pytest.fixture
def mock_order_db(self):
with patch("app.tools.order_db") as mock:
yield mock

A few things worth noting here. The tests cover not just the happy path but also the failure modes: missing orders, bad input, and unexpected exceptions.

Writing tests is not an agentic development practice specifically; it’s a foundational engineering practice that applies to any software you intend to put in people’s hands. In agentic systems, the cost of skipping tests is just higher because failures are harder to reproduce due to the non-deterministic nature of LLMs.

None of these rules is glamorous. They won’t make your demo more impressive or your architecture diagram on excalidraw more complex. But, walahi, most of the time, they are the difference between a multi-agent system that works reliably and one that quietly degrades until something goes wrong in front of a customer.

In summary, build narrow agents, define your contracts, log what matters, and test before it breaks.

Connect with me: https://www.linkedin.com/in/stephen-nwankwo-9876b4196/


Simple Rules To Follow When Building Multi-Agent Systems. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top