764 calls across 8 models: too much detail kills small models, filler words are load-bearing, and format preference is a myth

I wanted to know if the prompting advice you see everywhere, be specific, add examples, use XML tags, actually works on small local models. So I ran 572 calls across 8 models, 6 local on M2 96GB and RTX 5070 Ti via Ollama, and 2 frontier APIs (GPT-4.1-mini and Claude Haiku 4.5) for cross-validation. Total API cost was $0.03.

Three findings that changed how I prompt local models.

First, too much detail hurts small models. I tested the same task content at four levels of structural complexity: from minimal ("implement fizzbuzz") up to maximal (role + constraints + examples + edge cases). The 1.5B model went from 78% pass rate at minimal to 28% at maximal. That's a 64% drop from adding more detail. The 1B model dropped 11%. Models at 3.8B and above were completely unaffected, 94% across all complexity levels. The sweet spot for every model size was "role + constraints." No examples, no edge case lists. Adding more beyond that actively degrades output on anything under 3B.

Second, filler words are load-bearing for small models. I tested removing natural language filler, "basically", "I think", "in order to" simplified to "to" across model sizes. On qwen-coder 1.5B the pass rate dropped from 0.89 to 0.28. I pinpointed it to two specific operations: phrase simplification ("in order to" → "to") and filler deletion ("basically", "I think"). Each independently killed small model output. But character normalization and structural cleanup were safe across all sizes. The working theory is that sub-2B models use discourse markers as processing scaffolding. Remove the scaffolding and the output collapses. On API models the same simplification either helped or had zero effect. This is specifically a small model problem.

Third, format preference is a myth. Everyone says use XML for Claude, Markdown for GPT. I tested XML vs Markdown vs plain text across 4 local models: qwen-coder 1.5B, gemma 1B, gemma 4B, phi4 3.8B. 96 calls, 3 formats, 8 tasks each. XML 0.80, Markdown 0.80, Plain 0.83. No model showed significant format preference. Two independent studies found the same: Format Sensitivity paper (2411.10541) tested GPT-4 and saw 0-7pp deltas, not significant. Systima.ai ran 600 calls and got XML 98.4% = Markdown 98.4%. Anthropic recommends XML in their docs but cites zero quantitative evidence for it.

The practical takeaway for anyone running models under 3B locally: the prompting playbook is different from what works on frontier models. Keep prompts at role + constraints level. Don't strip filler words. Don't load up on examples and edge cases. The advice in prompt engineering guides is calibrated for GPT-4 and Claude, and some of it actively hurts small models.

One methodology lesson that almost cost me a wrong conclusion: never trust k=1 results on boundary models. A model I tested at k=1 showed "simplifying filler words hurts by 67%." At k=3 the same experiment showed "it helps by 26%." Completely opposite conclusion. Models in the 50-80% pass rate range are coin flips on single runs. If you're benchmarking local models from single-shot results on tasks near the capability edge, you're probably seeing noise.

Curious whether other people running local models have noticed prompt sensitivity differences compared to API models. My data is all coding tasks so I don't know if this generalizes to other workloads, but my gut says the small model prompting playbook is fundamentally different.

submitted by /u/No_Individual_8178
[link] [comments]

Leave a Comment