Qwen3-1.7B fine-tuned on synthetic data outperforms GLM-5 (744B) on multi-turn tool-calling: 437x smaller, trained from noisy production traces
TL;DR: We fine-tuned Qwen3-1.7B on synthetic data generated from noisy production traces. It scores 0.853 avg across 5 corruption scenarios on SGD tool, calling, beating GLM-5 (744B) at 0.835. Training directly on the same traces drops 14-28pp. A…