One Agent, Many Agents, or Something In Between? A Decision Framework for Agent Architecture

Stop asking “should I go multi-agent?”. Start asking “what kind of boundary do I actually need?”

The multi-agent debate has become weirdly binary. Every week I see a new post arguing for one of two camps: “one powerful agent with great tools is all you need” or “the future is swarms of specialized agents talking over A2A.” Both sides cite impressive numbers. Anthropic reported their multi-agent research system outperformed a single Claude Opus 4 by 90.2% on internal evals. Google’s research, on the same kind of benchmarks, found that coordination between agents can degrade sequential reasoning performance by 39 to 70%.

Both are right. They’re just measuring different things.

The problem is that “single-agent vs multi-agent” hides the architecture that actually matters in production. Over the last year of auditing client agents and building my own, I’ve converged on a simple observation: the choice isn’t between two options, or even three. It’s about three stackable layers — and the real design question is which layers your agent needs, not which “level” to pick. Most people either stay on one layer and drown, or skip straight to the top layer and pay the full infrastructure tax for problems the middle layer would have solved.

This article is the decision framework I use to pick the right combination of layers for each agent I build.

The Three Layers

Before we go further, one critical reframing: these are not competing architectures you choose between. They’re layers that stack. A production agent often uses all three at once — skills at the core, subagents for context-heavy operations, A2A at the boundary with other systems. The question isn’t “which level?” — it’s “which layers do I need, and why?”

From inside to outside:

Layer 1 — Skills. A single agent with a fixed toolkit (in my case, the 9-tools framework). All domain knowledge is loaded on demand as skills — SKILL.md files that teach the agent how to use its existing tools for a specific domain. One process, one context window, no orchestration. This is the core layer — every agent has it, whether they call it "skills" or not.

Layer 2 — Subagents. Same process, same codebase. But when the agent hits a task that needs a fresh context window or can be parallelized, it spawns a subagent via a Task tool or equivalent delegation primitive. The subagent runs in an isolated context, returns a distilled result, and dies. The parent agent keeps a clean main context. This layer sits on top of Layer 1 — the subagent itself is an agent that has its own skills.

Layer 3 — A2A. Separate processes, separate deployments, separate trust zones. Agents discover each other through Agent Cards, communicate over HTTP via the A2A protocol, and expose typed contracts. This is the “microservices for agents” world. And critically: each A2A-reachable agent is internally a Layer 1 agent (with its own skills), and can internally use Layer 2 (spawn its own subagents).

The three layers aren’t alternatives. They’re concentric:

*The three layers are concentric, not competing. Every agent has Layer 1. Layer 2 wraps it when context isolation is needed. Layer 3 wraps both when a structural boundary is required.*

A real production architecture looks like this: the user talks to one top-level agent. That agent is a Layer 1 agent (fixed toolkit + skills). For heavy context work, it spawns Layer 2 subagents. For work that needs to cross a trust or organizational boundary, it reaches out to other agents over Layer 3. Each of those agents is itself a Layer 1 agent internally, possibly with its own Layer 2 subagents.

The layers compose. Once you see the architecture this way, the decision stops being “which level am I at?” and becomes “which layers does this specific capability need, and where?”

Layer 1 — Skills: The Core Layer

I covered this architecture in depth in 9 Tools. That’s All Your Agent Actually Needs. The short version: give your agent a fixed 9-tool toolkit (file system, internet, execution, skill management), and treat every domain-specific capability as a SKILL.md file that's loaded on demand.

What it solves: context pollution. Tool selection accuracy jumps from around 60% to around 95% because the agent only ever sees 9 clearly differentiated tools plus the currently loaded skill. Startup context drops from 12,000+ tokens to under 2,000. Adding a new capability is as simple as dropping a Markdown file in a folder — no redeployment.

What it doesn’t solve: within a single long session, context still accumulates. Skill unloading helps — you can remove the SKILL.md prompt injection and compress execution traces to a short summary — but the conversation history, the tool call results, and the intermediate reasoning all stay in the window. By the time your agent has loaded, used, and unloaded three different skills in a row, the main context is muddy. You start seeing the model mix domains, forget earlier constraints, or pick the wrong skill on the next turn because the context is pulling it toward a previous topic.

This is the ceiling of Layer 1 on its own. Skill unloading is surgical cleaning, not a fresh context.

When Layer 1 alone is enough: single user, single trust zone, tasks that are mostly sequential, no hard parallelism requirement, no need for audit trails across separate systems. Most agents I build start here and stay here for a long time — and critically, even when they grow, Layer 1 never goes away. It’s always the core.

Layer 2 — Subagents: The Missing Middle

This is the layer nobody talks about, and it’s the one that resolves the biggest gap in Layer 1 without paying the A2A tax.

A subagent is a child agent spawned from the main agent’s process. It gets its own fresh context window, its own system prompt, and typically a narrower toolset. The parent delegates a bounded task — “research X”, “verify this artifact”, “extract a summary from this 20k-token file” — and receives back a distilled result. The subagent’s full context (tool calls, intermediate reasoning, raw data) never touches the parent’s window.

Crucially: a subagent is itself a Layer 1 agent. It has its own skills, its own toolkit. Layer 2 isn’t a replacement for Layer 1 — it’s Layer 1 wrapped in a delegation primitive.

Anthropic’s own research system uses this pattern: a lead agent (Opus 4) spawns 3–5 subagents (Sonnet 4) that run in parallel, each with 3+ tools concurrently. The parent only sees their final summaries. The full trace of what each subagent did stays contained.

What adding Layer 2 gives you on top of Layer 1:

True context isolation. Not “cleaned”, isolated. The parent never sees the subagent’s 15,000 tokens of raw search results.
Parallelization. Three research directions in parallel instead of sequentially, often 60–70% faster on investigation-heavy tasks.
Specialization without deployment. A “verifier subagent” with a strict system prompt and read-only tools is one file, not one service.

What Layer 2 still doesn’t give you:

A security boundary. The subagent runs in the same process, with the same credentials, the same network access. If the parent is compromised, the subagent is too.
An organizational boundary. Every subagent is owned by whoever owns the parent agent. Can’t hand off across teams.
Independent deployment. Update a subagent’s system prompt and you redeploy the whole thing.

The mental model I use: subagents are context isolation, not system isolation. They fix the problem of main-context pollution and they unlock parallelism. That’s it. That’s already a lot.

The best use cases I’ve seen for subagents:

Lookup and retrieval that returns too much raw data. Reading a 30k-token file to extract one summary. Querying a verbose API to extract three fields. The subagent digests, the parent sees only the distilled output.
Verification. After the main agent produces an artifact, spawn a verifier subagent with clear success criteria and read-only tools. It doesn’t need to know how the artifact was built — just whether it meets the spec. The telephone-game failure mode doesn’t apply because verification is naturally low-context-transfer.
Parallel exploration. Research, code search, multi-source data gathering. Things that can run independently and get merged at the end.

There’s a subtle caveat here, and it’s the one I’ve hit the most: the parent agent still decides when to spawn a subagent, and that decision is not deterministic. You can prompt it, you can give it clear criteria, but you’re still relying on the LLM to route correctly. If the routing matters, Layer 2 alone won’t cut it — you need to add Layer 3 on top.

Layer 3 — A2A: When Agents Become Services

I covered the full protocol stack in MCP + A2A via Skills: The Complete Protocol Stack Your Multi-Agent System Needs. The short version: A2A is the protocol that lets agents discover each other (via Agent Cards), delegate bounded tasks (with a proper lifecycle: submitted → working → completed), and collaborate without exposing their internal state.

Remember: A2A doesn’t replace the inner layers. Each agent you reach over A2A is itself a Layer 1 agent with its own skills, and likely uses Layer 2 subagents internally. A2A is the outermost boundary — the layer where agents meet other agents across trust zones.

The thing most articles get wrong about A2A is framing it purely as “agent-to-agent communication.” That’s technically true but it misses the structural reason A2A matters: A2A is where agents stop being code and start being services.

*A2A when the skill server hands your agent the whole registry:* “Mr. Anderson, welcome back. Here are the 17 specialized agents you can now call.” — GIF from GIPHY

Three things change when you add Layer 3 on top of your architecture:

1. You get a real determinism boundary on routing. A2A agents expose typed capabilities through their Agent Card. When your trading agent calls the email agent over A2A, the request is a structured task with a defined input schema. The routing isn’t “the LLM decides whether to spawn a subagent” — it’s a deterministic function call to a known service with a known contract. The execution of the target agent is still probabilistic (it’s an LLM inside, running its own Layer 1 skills), but the decision to invoke it and the shape of what comes back are pinned down. That’s a huge shift in reliability.

2. You get a real trust boundary. Separate process, separate credentials, separate network surface. This is where it stops being a purely technical decision and becomes an organizational one. Some agents touch sensitive data — customer PII, payment flows, internal financial records. Putting them behind A2A with their own sandbox isn’t overhead, it’s exactly the isolation your security team is going to ask for. A compromise of your public-facing agent should not cascade to your compliance agent.

3. You get a real organizational boundary. This is the angle most technical articles skip. In a real company, different teams own different agents. The marketing team’s content agent, the ops team’s scheduling agent, the finance team’s reconciliation agent — these should not be a single codebase. A2A gives you the protocol that lets them interoperate without requiring a monorepo, a shared deployment pipeline, or a single tech stack. One team can use LangGraph, another ADK, another a raw Anthropic SDK loop. They meet at the Agent Card.

What adding Layer 3 costs you:

Latency per handoff. Network call, serialization, a fresh context load on the target agent’s side. Minimum 300–800ms extra per hop, often more.
Token sprawl. Each agent restates context at each handoff. Research from Atlan suggests independent agent networks can amplify token costs significantly compared to shared-context alternatives.
Ops complexity. You’re running distributed systems now. Tracing, observability, auth, retry logic, all of it. This is non-trivial.

When Layer 3 earns its place: when the boundary you need is structural — security, organizational, contractual, regulatory — and no amount of subagent design will give it to you. Not before.

The Decision Framework

Here’s the comparison table I use when deciding which layers to add:

*The ten criteria that tell you which layer pays off where. Note the highlighted row: routing determinism is the weak link of Layer 1 and the real reason to reach for Layer 3.*

Remember: you don’t pick one column. You pick which layers to add on top of the previous one. Every agent has Layer 1. Most production agents eventually add Layer 2 selectively for heavy-context operations. Only some agents need Layer 3 — and when they do, Layer 3 is added at specific boundaries, not globally.

Note what’s subtle about the determinism row: Layer 1 skills have very high execution determinism — a SKILL.md plus a Python script runs exactly the same way every time. What's not deterministic at Layer 1 is the routing: whether the agent picks the right skill for the task. That's an LLM decision, and it's the weak link. Adding Layer 2 reduces routing uncertainty (you're now deciding "spawn a subagent yes/no" rather than "which of my 14 skills"), but it's still an LLM call. Only Layer 3 gives you a fully deterministic routing contract — because at that point the call is a typed service invocation, not a prompt.

And if you really need full determinism end-to-end, you can wrap Layer 3 in a workflow-mode orchestrator, where the sequence of agent calls is hardcoded, not LLM-decided. That’s the world where LLM reasoning is contained inside individual agents (each still using Layers 1 and 2 internally), and the glue between them is deterministic Python. For regulated domains, that’s often the only shape that ships.

The Four Criteria That Trigger Adding a Layer

1. Context saturation. If a typical session fills more than ~60% of the main agent’s context window before it completes — even with aggressive skill unloading — Layer 1 alone is bottlenecking you. Add Layer 2 for the heavy-context operations (big file reads, multi-source research, verification). You don’t refactor the whole agent — you add subagents where the context pressure is highest.

2. Routing determinism required. If a specific capability has to produce a typed, auditable output and the cost of a wrong routing decision is high (compliance, finance, anything with a real-world side effect), that specific capability goes behind Layer 3. The typed contract of A2A is what makes it auditable in a way a skill selection can’t be. Other capabilities of the same agent can stay in Layers 1 and 2.

3. Trust boundary required. If a capability touches data or systems that should not share a runtime with the rest of your agent — customer data, payment rails, regulated records — that capability lives behind Layer 3. Not because you can’t write a careful skill, but because the boundary has to be enforced by the infrastructure, not by the prompt.

4. Organizational boundary required. If two teams need to own two agents independently, with separate deploy pipelines, separate SLAs, and separate evaluation loops — Layer 3 between them is mandatory. Any attempt to keep them in a single agent will eventually become a cross-team coordination tax that makes everyone unhappy.

Notice the pattern: every criterion is a trigger to add a layer, not to switch to one. Your agent can have a Layer 1 core, use Layer 2 for heavy lookups, AND expose certain capabilities over Layer 3 — all at once. That’s not complexity, that’s the whole point.

*Start at Layer 1. Add Layer 2 when context saturates. Add Layer 3 only when one of four structural criteria applies. Otherwise, stop — you’re done.*

The Practical Path (And Where Most People Screw Up)

Here are the three anti-patterns I see most often on client audits:

Anti-pattern 1: Jumping straight to Layer 3 because “A2A is modern.” The team builds a protocol stack with Agent Cards, service discovery, and a coordinator agent — for a product that has one user and no trust boundaries. They spend three months on infra for a problem a subagent in the same process would have solved in a week. The Google “coordination drop of 39–70%” result is exactly this: Layer 3 overhead on sequential tasks that didn’t need anything beyond Layers 1 and 2.

Anti-pattern 2: Staying on Layer 1 alone and piling up 400 skills. Every new capability becomes another SKILL.md. The catalog gets huge, skill-selection accuracy drops, and context pollution returns through the back door (not because tools are loaded, but because the agent has loaded and unloaded so many skills in one session that the history is noisy). The fix almost always isn't "organize the skills better" — it's "add Layer 2 for these specific heavy operations."

Anti-pattern 3: Using Layer 3 for sequential reasoning that should have stayed internal. “Planner agent calls Researcher agent calls Executor agent calls Reviewer agent” over A2A when all four steps needed to share context anyway. This is the telephone-game failure mode — Anthropic’s own research team documented it. Each handoff loses fidelity. If the steps aren’t truly independent and don’t cross a structural boundary, collapse them back into one agent using Layer 1 + Layer 2 internally.

The Trajectory That Works

Start at Layer 1. Always. Build one agent with the skill framework. Validate the use case. Learn what your agent actually does.
Add Layer 2 where context isolation pays. Verification, lookup, parallel research, heavy file processing. Subagents, not new top-level agents.
Add Layer 3 only where a structural boundary is required. Trust, compliance, org ownership, independent deployment. These are real reasons. “It feels more modern” is not. And when you add Layer 3, you add it at a specific boundary — not as a wholesale architecture.

The Single Point of Entry

One last thing, because this confuses people: because the layers stack, a real architecture almost always uses all three at once. The user always talks to one agent. That top-level agent is a Layer 1 agent (fixed toolkit + skills). Internally, it spawns Layer 2 subagents for heavy operations. And for work that needs a separate trust zone, it reaches out to Layer 3 peers — each of which is itself a Layer 1 + Layer 2 agent under the hood.

The three layers compose. The user never sees the plumbing. That’s the whole point.

If the user has to know whether they’re “talking to the orchestrator” or “talking to the scheduling agent,” you’ve already failed the UX.

A realistic agent architecture uses all three layers at once. The user talks to one top-level agent. Inside, Layer 1 holds the skills, Layer 2 handles heavy context work, and Layer 3 crosses into a separate email agent — which itself runs Layer 1 and Layer 2 internally.

The Bottom Line

The question isn’t “should I go multi-agent?”. It’s “which layers does this specific capability need, and where?”

Need cleaner context? → Add Layer 2 where the pressure is.
Need deterministic routing, trust isolation, or a different team to own it? → Add Layer 3 at that specific boundary.
None of the above? → Stay on Layer 1 and stop over-engineering.

Most agents I audit are either Layer-1-only agents drowning in their own context (need to add Layer 2), or wholesale Layer 3 architectures solving problems that didn’t need a network boundary (need to collapse some peers back into subagents). Getting the layers right is usually the single biggest lever on agent quality, cost, and maintainability.

Three layers. Four criteria. Compose them deliberately.

That’s the framework.

Thanks for reading! I’m Elliott, a Python & Agentic AI consultant and entrepreneur. I write weekly about the agents I build, the architecture decisions behind them, and the patterns that actually work in production.

If this framework helped clarify where your own agents sit on the spectrum, I’d appreciate a few claps 👏 and a follow. And if you’ve landed on a different decomposition that works in production — I’d love to hear about it in the comments.

One Agent, Many Agents, or Something In Between? A Decision Framework for Agent Architecture was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.