Your AI Agent Is Goldfish-Brained. Qwen3.6–35B-A3B Is the Fix.

The open-weight upgrade that changed how agents reason — and why one new parameter matters more than all the benchmark numbers combined.

Every LLM agent you’ve ever deployed has a problem it doesn’t advertise. Ask it to refactor a module, get three turns in, then ask it to revisit its earlier reasoning — and watch it re-derive conclusions it reached two turns ago from scratch. The model doesn’t carry its thinking forward. Each response ends, the <think> block evaporates, and the next turn begins from a blank slate. The intelligence is real. The memory of how it got there is gone.

This is the goldfish-brain problem. The model is capable, but stateless in the one place where statefulness would matter most: its own reasoning chain.

Qwen3.6–35B-A3B, released by Alibaba in April 2026, is the first open-weight model from the Qwen3.6 generation. It runs the same 35B-total / 3B-active hybrid architecture covered in my earlier piece on Qwen3.5–35B-A3B — but the upgrade from 3.5 to 3.6 wasn’t about scaling the model or redesigning the stack. It was about fixing the goldfish.

If you liked this article, please clap — and if you’re feeling generous, you can give up to 50 claps 👏

The Foundation You Already Know (Don’t Skip If You’re New)

Qwen3.6–35B-A3B shares its skeleton with Qwen3.5–35B-A3B. If you want the full architectural breakdown — how Gated DeltaNet handles long sequences without quadratic scaling, why the 3:1 hybrid between linear and full softmax attention is deliberate rather than a compromise, and how 256 experts route each token through exactly 9 active sub-networks — that article covers it in depth.

The short version: 35 billion parameters are stored, 3 billion are activated per token, and a sliding mix of linear attention (cheap, stateful) and full softmax attention (expensive, precise) handles the load across 40 layers. The result is a model that runs on a single 24 GB GPU with appropriate quantization and outperforms models with 7× more active parameters on practical coding tasks.

What changed in 3.6 is the behavior, not the bones.

The Three Things Qwen3.6 Actually Shipped

Alibaba described the release as “built on direct feedback from the community” (HuggingFace model card) — an unusual framing for a model announcement. The delta from 3.5 is targeted rather than sweeping:

  1. Thinking preservation across conversation turns
  2. Multi-Token Prediction for faster inference
  3. Sharpened agentic coding and instruction-following

Each sounds incremental on a changelog. The first one, in practice, changes how you structure agents.

Thinking Preservation: The Detective Who Burns Their Notes

Standard LLM inference generates a hidden <think> block before producing a visible response. That reasoning trace — the model working through intermediate steps, checking constraints, ranking approaches — is discarded the moment the response is sent. On the next turn, the model has only the conversation history: questions, answers, and tool outputs. The how-I-got-here is gone.

For a single-turn query, this doesn’t matter. For a coding agent running across 20 turns of incremental debugging, it compounds. The model re-explores territory it already mapped, re-considers constraints it already resolved, and sometimes contradicts its own prior conclusions without realizing it — because it can’t see that it reached those conclusions.

Think of a detective who burns their case notes at the end of every day and starts fresh with only interview transcripts. The transcripts tell them what was found and what was said, but not why each lead was ruled out. They’ll re-investigate closed threads.

Qwen3.6–35B-A3B introduces preserve_thinking, a parameter that retains reasoning traces from all prior turns in the active context. When enabled, the model can see not just what it said but how it reasoned to get there. For iterative development — incremental refactoring, multi-step debugging, long-horizon planning — this eliminates a class of loop-and-repeat failures that previously required prompt engineering workarounds.

The parameter is off by default. In default mode (interleaved thinking), only the current turn’s trace is retained — lower context overhead, sufficient for most tasks. Enabling preservation:

from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
# Multi-turn agent - the model sees its own prior reasoning on each new turn
conversation = [
{"role": "user", "content": "Add rate limiting to this FastAPI endpoint:\n\n@app.get('/data')\ndef get_data():\n return db.query_all()"}
]
turn1 = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=conversation,
max_tokens=32768,
extra_body={
# Reasoning trace from this turn is kept in context for turn 2+
"chat_template_kwargs": {"preserve_thinking": True},
},
)
conversation.append({"role": "assistant", "content": turn1.choices[0].message.content})
# Turn 2 - model sees its prior reasoning, not just its answer
conversation.append({
"role": "user",
"content": "Now return a custom JSON error body when the limit is exceeded."
})
turn2 = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=conversation,
max_tokens=32768,
extra_body={"chat_template_kwargs": {"preserve_thinking": True}},
)

One constraint: reasoning traces are verbose. A 15-turn agent session with preserve_thinking enabled can consume 50–80K tokens of context budget in reasoning alone, before factoring in code, tool outputs, or document context. For sessions beyond ~10 turns, budget your context window deliberately — or build selective truncation of older traces into your agent loop.

Multi-Token Prediction: Your Model Just Got Faster Without Getting Smaller

Standard autoregressive decoding is strictly serial. The model does a full forward pass, predicts one token, appends it to the context, and does another full forward pass for the next one. For a 35B-parameter model, that’s expensive — and worse, the GPU sits mostly idle during each pass. Single-token generation doesn’t fill the parallel compute units the hardware is designed for. You’re paying for a Ferrari and driving it in first gear.

Multi-Token Prediction changes the shape of the work. During training, Qwen3.6 learned to predict not just the next token, but several future tokens at once, using extra output heads that share the same internal representation. At inference time, this enables speculative decoding: the model drafts a few tokens ahead in one cheap step, then verifies them all in a single forward pass of the main model. If the drafts match what the model would have chosen anyway, they get accepted in a batch and appended together. If some drafts are wrong, they get rejected and corrected. On a good acceptance run, you get 2–3 tokens for the cost of 1 forward pass.

The analogy: a writer who drafts a whole phrase and only retracts it if the editor objects. Most short continuations in structured output — code, JSON, markdown — are predictable enough that the editor rarely objects. After def almost certainly comes a function name. After "key": comes a value. The model is betting on patterns it already knows, and the bet usually pays off.

What makes Qwen3.6’s MTP more reliable than older speculative decoding is that the draft model is the target model. Earlier speculative setups used a separate small draft model to generate guesses, which created a distribution mismatch — the small model’s predictions often didn’t match what the big model would have picked, so drafts got rejected and compute got wasted. MTP self-drafts from the same weights that will verify, so the distributions align and acceptance rates are much higher.

Throughput gains are workload-dependent. Reasoning-heavy outputs — long <think> blocks followed by structured answers — benefit most, because structured text has low entropy and drafts get accepted often. Open-ended creative generation benefits less, because when the next token could genuinely be anything, drafts get rejected more often and you fall back to normal single-token decoding.

vLLM with MTP speculative decoding:

vllm serve Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

The num_speculative_tokens: 2 flag controls how many tokens the model drafts ahead per step. Going higher (3 or 4) increases the potential speedup when drafts are accepted, but the probability that all drafts in a longer sequence match the target distribution drops geometrically — push too far and you waste compute on rejected guesses. Two is the balanced default for most workloads; three can be worth testing if your outputs are heavily structured.

SGLang (recommended when MTP throughput is the priority — it uses a tree-based speculation scheme that drafts multiple candidate branches in parallel, which pushes acceptance rates higher at the cost of slightly more verification work):

python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tp-size 4 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4

The Benchmark Delta: What Moving from 3.5 to 3.6 Actually Bought You

All scores below are from Alibaba’s own HuggingFace model card — self-reported, single source. No independent third-party reproduction existed at publication date. The Gemma 4 (31B) comparison column is also from Alibaba’s reproduction on their harness, not Google’s own reports — Google’s public Gemma 4 benchmarks focus on AIME, LiveCodeBench, and GPQA rather than SWE-bench. These numbers are directional, not authoritative.

Coding and Agentic Tasks

Terminal-Bench 2.0 (+11 points) and NL2Repo (+8.9) are the sharpest jumps. Both measure practical agent behavior in terminal environments and multi-file repository tasks — closer to what production coding agents actually encounter than the headline SWE-bench Verified number. The SWE-bench Verified delta of +3.4 is meaningful but modest; the real story is in the applied benchmarks.

For external context: SWE-bench Verified at 73.4% sits roughly 14 points below Claude Opus 4.7 (87.6%, self-reported by Anthropic) and about 7 points below the previous-generation Opus 4.6 (80.8%). The gap to the frontier is real, but Qwen3.6 runs on hardware you own under Apache 2.0 — a different value proposition than cloud-hosted closed models.

SWE-bench Verified at 73.4% sits roughly 7 points below Claude Opus 4.6 (80.9%, self-reported by Anthropic). The gap is narrowing, and Qwen3.6 runs on hardware you own under Apache 2.0.

Knowledge and Reasoning

MMLU-Pro barely moved. This is not a model that improved on generic knowledge retrieval. Gains are concentrated in reasoning under competition-math and graduate-science conditions. The MMLU-Pro flatness is not a regression — it reflects the different optimization target.

Vision and Multimodal (Qwen-reported; see note below)

Vision gains are incremental. Document understanding (OmniDocBench 89.9%) and video reasoning (VideoMMU 83.7%) are the practical targets for enterprise document workflows and long-form video analysis.

One caveat on the Claude Sonnet 4.5 column: these are Alibaba’s reproductions on their own evaluation harness, not Anthropic’s self-reported numbers. Anthropic’s own MMMU for Sonnet 4.5, for example, is 77.8% rather than the 79.6% shown here. Cross-vendor multimodal comparisons rarely use matching evaluation conditions, and small differences in prompt template, image preprocessing, or scoring pipelines can move scores by several points. Treat the Claude column as directional context rather than a head-to-head head count.

Architecture in Brief

Skip this section if you read the Qwen3.5 breakdown.

The hybrid attention structure is identical to Qwen3.5, so I’ll keep this short — but enough to make sense of it without jumping to the other article.

The model has 40 layers, organized into 10 repeating groups of 4 layers each. Every group follows the same rhythm: three GatedDeltaNet blocks followed by one GatedAttention block. Both block types are paired with a Mixture-of-Experts (MoE) feedforward step, so a full group looks like this:

[DeltaNet + MoE] [DeltaNet + MoE] [DeltaNet + MoE] [FullAttention + MoE]

That 3:1 ratio is the whole story of the attention design.

The DeltaNet layers do the cheap work. Standard attention is expensive because every token compares itself to every other token in the sequence — that’s the famous O(n²) cost that makes long contexts prohibitively slow. DeltaNet sidesteps this by using linear attention: instead of comparing pairwise, it carries forward a compact “state” that summarizes the sequence so far, and updates that state selectively as each new token arrives. Compute scales as O(n), not O(n²). The sigmoid gating mechanism is what decides, per token, which parts of the state to update and which to preserve — think of it as a learned filter that says “keep this memory, overwrite that one.” This is what makes DeltaNet fast enough to handle 262K-token contexts on hardware you can afford.

The full-attention layer at the end of each group does the precise work. Linear attention is fast, but compressing the sequence into a running state costs something: fine-grained positional precision. On code and structured reasoning — where “this bracket closes that bracket 47 lines above” matters exactly — that loss hurts. The single GatedAttention block at the end of each group is a full O(n²) attention pass that re-anchors positional detail before the next group begins. Three cheap layers do the bulk of the work, one expensive layer keeps the model honest.

Then comes the MoE part, and this is where the 35B-vs-3B number comes from.

Every feedforward step routes tokens through a pool of 256 expert sub-networks. For each token, a learned router picks 8 experts dynamically based on what the token looks like. On top of those 8, there’s 1 shared expert that processes every token regardless of routing — a kind of always-on general-purpose layer that handles things common to all inputs. So per token, 9 experts actually fire; the other 247 sit idle in memory.

This is why the model is described as “35B total, 3B active.” All 35 billion parameters are stored (and loaded on the GPU, or offloaded to CPU RAM), but only about 3 billion are touched by any given forward pass. You get the representational capacity of a large model with the inference cost of a small one — the core tradeoff that makes this architecture deployable on a single 24 GB GPU.

For the deeper mechanics — why the 3:1 hybrid ratio is specifically 3:1 rather than 2:1 or 5:1, what the sigmoid gate actually does to activations at the numerical level, and how the router learns to partition work across experts without collapsing to a few favorites — see The Architecture That Broke the Scaling Myth.

Context Window: 262K Native, 1M If You Actually Need It

Native context is 262,144 tokens. No configuration required beyond --max-model-len 262144 in your serving command. For most real workloads — even large document processing, multi-file code repositories, and long agent sessions — 262K is more than enough.

What YaRN Actually Does

Before the configuration, it helps to know what you’re turning on. Transformers track token positions using RoPE (Rotary Position Embedding) — each position gets assigned a unique rotational “angle,” and the model learns to interpret those angles to understand word order. The catch: the model only knows the angle range it was trained on. Feed it a sequence longer than 262K tokens and the position angles enter territory the model has never seen, and outputs collapse into nonsense.

YaRN (Yet another RoPE extensioN) solves this without retraining the model. Instead of teaching the model new angles, it rescales the existing ones — compressing them so that longer sequences fit inside the angle range the model already understands. A simple analogy: you have a ruler marked 0–262 cm, and you need to measure something 1000 cm long. You can either buy a longer ruler (retrain the model — expensive), or you can reinterpret each centimeter mark as representing 4 cm (apply YaRN with factor: 4.0). The ruler stays the same; you just read it differently.

The tradeoff is what makes this non-trivial. Rescaled angles aren’t the angles the model was trained on, so even within the original 262K range, every position looks slightly “unfamiliar” to the model. This produces a measurable quality degradation on short inputs — the model is now interpreting a 10K-token input as if the positions were spread across a stretched-out coordinate system it never fully learned. The larger the factor, the more aggressive the rescaling, and the more the short-input quality suffers.

The Static YaRN Problem

YaRN scaling extends context to approximately 1,010,000 tokens using a factor of 4.0. The official Qwen3-Next model card explicitly flags a caveat that carries over to the Qwen3.6 generation: all notable open-source serving frameworks implement static YaRN — the scaling factor applies uniformly regardless of the actual input length. The vLLM recipe documentation confirms the same serving behavior for Qwen3.5 and Qwen3.6 deployments.

If most of your inputs are short (< 32K tokens) but you’ve enabled YaRN for occasional long-document passes, you’ll take a quality degradation on the short inputs that far outnumber the long ones. The math works against you: a small per-request quality loss multiplied across thousands of short requests usually outweighs the benefit of being able to occasionally handle a million-token document in the same server.

The recommendation from the model card: only modify rope_parameters when long-context processing is actually required, and tune factor to match your typical input length. factor: 2.0 extends context to roughly 524K tokens and applies a much milder rescaling penalty than factor: 4.0 — preferable for mixed workloads where you occasionally need long context but don't routinely hit the 1M mark.

YaRN with vLLM

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.6-35B-A3B \
--hf-overrides '{
"text_config": {
"rope_parameters": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 262144
}
}
}' \
--max-model-len 1010000

The Two-Instance Pattern

For routing architectures that need to handle mixed-length workloads, the cleanest solution is to deploy two instances — one at native 262K with no YaRN, one with YaRN enabled — and route by estimated input length at inference time. Requests under ~200K go to the native instance at full quality; requests above that threshold go to the YaRN instance where the rescaling penalty is acceptable because the alternative is failure.

This sounds like extra infrastructure, but the static YaRN penalty on short inputs is real enough to warrant the split if throughput and response quality both matter. A single YaRN-enabled instance serving everything is the simpler deployment, but it’s the wrong choice for any workload whose distribution isn’t heavily weighted toward million-token inputs.

What Hardware Do You Actually Need

The official FP8 checkpoint is published at Qwen/Qwen3.6-35B-A3B-FP8. FP8 halves the memory footprint compared to BF16 with minimal quality loss — it's the cleanest option if you have two 24 GB GPUs and want near-full model fidelity.

For smaller hardware, you’ll want a GGUF quantization. GGUF is the unified single-file format used by the llama.cpp ecosystem. It packs model weights, tokenizer, and metadata into one binary, and — more importantly — it supports aggressive weight quantization schemes (4-bit, 5-bit, 8-bit variants) that trade small amounts of output quality for dramatic memory reductions. A 35B model that needs ~70 GB in BF16 can fit in under 20 GB as a 4-bit GGUF, at the cost of a few points on most benchmarks.

The Unsloth team publishes tested GGUF builds at unsloth/Qwen3.6-35B-A3B-GGUF using their Dynamic 2.0 quantization method — benchmarked as SOTA on mean KL divergence in 21 of 22 model sizes, meaning their quantized outputs track the original BF16 model more closely than alternatives. The practical tradeoffs across common quants:

  • Q4_K_M (~19 GB) — fits a single RTX 3090/4090, shows mild quality degradation (expect a couple of points drop on tight coding benchmarks)
  • Q5_K_M (~22 GB) — the sweet spot for quality-per-GB on 24 GB consumer GPUs; quality is within reach of FP8
  • Q8_0 (~37 GB) — near-lossless, but you’re better off running the official FP8 checkpoint at this size tier
  • UD-Q2_K_XL (~12 GB) — 2-bit Unsloth Dynamic quant; surprisingly usable for tool-calling workflows but not recommended for long reasoning chains

KTransformers takes a different path: CPU offloading. MoE-sparse models like Qwen3.6 benefit from this more than dense models do, because of how the architecture works — only 9 of 256 experts activate per token, so the other 247 are sitting idle in memory at any given moment. KTransformers exploits this by selectively pinning only the router and the shared expert to GPU memory, and keeping the 200+ idle experts in system RAM. When the router picks experts for a token, the needed weights are fetched from RAM to GPU on demand.

The tradeoff is throughput — you’ll see lower tokens/second than a full-GPU setup because of the RAM-to-VRAM transfer overhead. But the memory footprint drops to under 12 GB VRAM for a 35B model, which makes consumer desktops (one modest GPU + 32 GB RAM) a viable deployment option. If you’re running this locally for experimentation rather than production throughput, KTransformers is the path that unlocks the widest range of hardware.

Full Code Reference

Basic inference, thinking mode enabled:

from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[
{"role": "user", "content": "Refactor this to handle null inputs:\n\ndef parse_config(data):\n return data['key']['nested']"}
],
max_tokens=81920,
temperature=1.0, # recommended for thinking mode - higher temp improves reasoning diversity
top_p=0.95,
extra_body={"top_k": 20, "presence_penalty": 1.5},
)
# Response includes the <think>...</think> block followed by the final answer
print(response.choices[0].message.content)

Thinking disabled (faster, for production instruct tasks):

response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[{"role": "user", "content": "Extract all email addresses from this text: ..."}],
max_tokens=32768,
temperature=0.7, # lower temp is correct here — instruct tasks benefit from determinism
top_p=0.8,
extra_body={
"top_k": 20,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": False},
},
)

Image input (document understanding, diagram analysis):

response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "https://your-host/architecture-diagram.png"}
},
{
"type": "text",
"text": "What design pattern does this diagram show? List every component and describe how data flows between them."
}
]
}],
max_tokens=32768,
temperature=0.7,
top_p=0.8,
)

Recommended sampling parameters by task (from the model card):

The coding-specific temperature reduction (0.6 vs 1.0) is deliberate: frontend code generation benefits from tighter token distributions because valid syntax spaces are narrow. Lower temperature reduces generation of valid-but-wrong-pattern code.

Counter-Arguments and What We Don’t Know

All benchmarks are self-reported by Alibaba. No independent third-party reproduction of the Qwen3.6–35B-A3B scores exists at publication date. QwenClawBench, QwenWebBench, and NL2Repo are internal evaluations — Alibaba controls both the benchmark construction and the evaluation. Terminal-Bench 2.0 and SWE-bench are external, but methodology choices (scaffold, temperature, retries) affect results enough that cross-model comparisons require matching conditions to be meaningful.

No training disclosure. Neither the pre-training corpus, training compute, training cutoff date, nor post-training alignment methodology (RLHF, DPO, or otherwise) are documented. The model card and blog post discuss capabilities, not how they were achieved.

No paper. The design rationale for specific changes from 3.5 — including what prompted the thinking preservation feature and what training changes drove the Terminal-Bench jump — cannot be scrutinized. The +11 point Terminal-Bench gain is either a genuine improvement or reflects benchmark overfitting; without training details, distinguishing the two is impossible.

Preserve thinking has a context cost. Verbose reasoning traces accumulate rapidly. A 20-turn agent session with preserve_thinking enabled can exhaust 80–100K tokens of context budget in reasoning alone before accounting for code or tool outputs. This is not a reason to avoid the feature — it's a reason to build context budget management into your agent loop.

Static YaRN is a real limitation. The model card flags it explicitly. Mixed-workload deployments that enable YaRN globally for occasional long documents will see quality degradation on short inputs. Route by context length if this matters to you.

The 3.5→3.6 delta is concentrated, not general. MMLU-Pro did not improve. Standard knowledge retrieval and general-purpose chat quality did not change meaningfully. Qwen3.6 is a better agent and a better reasoner; it is not a better general-purpose assistant than 3.5.

What This Means for You

Three practical takeaways for engineers deploying this today:

First, benchmark preserve_thinking on your specific agent tasks. The feature is architecturally motivated — not a prompt trick — and the qualitative difference on iterative debugging tasks is measurable. For multi-turn coding agents, this is the most impactful feature in the release. Enable it for sessions where the model needs to track constraints across turns; disable it for isolated-turn inference to save context budget.

Second, use SGLang with MTP speculative decoding if throughput is a constraint. Reasoning models spend a large fraction of their inference budget on <think> blocks. MTP speculative decoding recovers throughput at no quality cost precisely in the output patterns (structured code, JSON, constrained text) where speculation accuracy is highest. The self-drafted MTP approach avoids the distribution-mismatch problem that affects external draft models.

Third, the Apache 2.0 license makes this a serious option for production agentic workloads. SWE-bench Verified at 73.4% (self-reported) running on two consumer GPUs under an unrestricted commercial license is a different value proposition than cloud-hosted proprietary models. The gap to the current frontier — Claude Opus 4.7 at 87.6%, GPT-5.3-Codex at 85.0% — is real but not prohibitive for most production use cases, especially when you factor in local inference, no per-token billing, and full control over the deployment environment.

Sources


Your AI Agent Is Goldfish-Brained. Qwen3.6–35B-A3B Is the Fix. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top