Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
arXiv:2605.11907v2 Announce Type: replace
Abstract: We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations.
Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural-$\Delta$ lift is roughly uniform across sizes: $+0.070 / +0.040 / +0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $\Delta$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B.
Methodology. (i) A bench format-compliance artifact: 83.5% of the holdout uses a deterministic ANSWER-line extractor that under-counts free-form-prose conclusions; our LLM-only re-judge reveals it was systematically biased against the curated condition. (ii) A negative-iteration sequence at 0.8B: three well-formed recipe variants cluster post-SFT curated pass-rate within a 2 pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe.
Cross-family judge validation. GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $\kappa \geq 0.754$, agreement $\geq 93.25\%$, max headline $\Delta$ shift $\leq 0.035$ pp.
Two earlier framings -- "format-only learning at 0.8B" and "SFT contribution shrinks at 4B" -- were path-mismatch artifacts; this paper supersedes both. Single-seed evaluation; threats itemised in the paper.