Qwen3-1.7B fine-tuned on synthetic data outperforms GLM-5 (744B) on multi-turn tool-calling: 437x smaller, trained from noisy production traces

Qwen3-1.7B fine-tuned on synthetic data outperforms GLM-5 (744B) on multi-turn tool-calling: 437x smaller, trained from noisy production traces

TL;DR: We fine-tuned Qwen3-1.7B on synthetic data generated from noisy production traces. It scores 0.853 avg across 5 corruption scenarios on SGD tool, calling, beating GLM-5 (744B) at 0.835. Training directly on the same traces drops 14-28pp. All code is open-source.


What we found

We're part of the Distil Labs team and we've been seeing the same pattern with customers: training directly on clean, human-annotated data works great. But real production traces are noisy, wrong tool calls, renamed APIs, mixed-in data from other services — and training directly on them produces models that confidently repeat those mistakes.

So we benchmarked it. The key finding: production traces are not reliable training data unless you use them as context for synthetic data generation.

Key specs: - Student model: Qwen3-1.7B, LoRA rank 64, 4 epochs, lr 5e-5 - Task: Restaurant booking agent — FindRestaurants, ReserveRestaurant, respond_to_user - Dataset: Schema Guided Dialogue (SGD) from Google Research - Eval: LLM-as-a-judge (GPT-OSS-120B), ~360 turn pairs from 34 held-out conversations

Results

Tested across 5 scenarios that simulate real-world trace corruption:

Scenario Synthetic Direct Delta
Curated data (clean) 0.866 0.864 +0.2pp
Noisy labels (50% corrupted) 0.844 0.721 +12.3pp
Schema drift (renamed functions) 0.844 0.585 +25.9pp
Low data (5 traces only) 0.852 0.649 +20.3pp
Irrelevant trace mixing (80% wrong domain) 0.858 0.694 +16.4pp

Best frontier teacher: GLM-5 (744B) = 0.835. The 1.7B student beats it in every scenario.

When the data is clean and human-annotated, both pipelines tie — direct training works fine. But the moment you use real production traces with their inevitable noise, direct training collapses. The synthetic pipeline stays within 2pp of the clean-data ceiling every time.

How the synthetic pipeline works

Instead of training directly on traces, we use them as context for a teacher LLM to generate clean training data:

  1. Production traces go in as unstructured context — not as training labels
  2. Teacher LLM (GLM-5) reads task description + tool schema + trace samples
  3. Generates ~2,000 clean multi-turn conversations (~45k turns)
  4. Validation layer checks schema conformance, removes duplicates/outliers
  5. Fine-tune student on the validated dataset

The traces tell the teacher what the domain looks like. The schema tells it what correct looks like. That separation is the whole trick — the teacher extracts the useful domain signal from traces while being guided by the schema for what correct behavior should be.

Why this matters for running small models

If you're deploying a small model for tool-calling, you don't need thousands of clean labeled examples. You need a handful of production traces (even noisy ones) plus a correct tool schema, and the synthetic pipeline handles the rest.

The schema drift result is particularly relevant if you iterate on APIs — just renaming functions between versions makes all your old traces useless for direct training. The synthetic pipeline reads the current schema, not the trace vocabulary.

Limitations

  • Single task domain (restaurant booking) — we haven't tested across radically different domains yet
  • LLM-as-a-judge eval rather than human evaluation
  • Only tested with Qwen3-1.7B as student — unclear how results scale to larger/smaller students

Curious if anyone has tried fine-tuning tool-calling models on production data — what was your experience with data quality?

submitted by /u/party-horse
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top