I trained Qwen 3.5 2B to filter tool output for coding agents.

Agents can spend a lot of context on raw pytest, grep, git log, kubectl, pip install, file reads, stack traces, etc., even though usually only a small block is relevant.

We've built benchmark for task-conditioned tool-output pruning and fine-tuned Qwen 3.5 2B on it with Unsloth. The benchmark is a combination of tool outputs from the SWE-bench dataset and synthetic examples.

Results on the held-out set:

86% recall
92% compression
Beats other pruners and zero shot models (+11 recall over zero-shot Qwen 3.5 35B A3B)

We released squeez as a CLI, you can put it in front of tool output before the next reasoning step, or add it to something like CLAUDE md as a lightweight preprocessing step. You can serve squeez with any inference framework, e.g. VLLM.

Everything is open source, check out for details:

paper: https://arxiv.org/abs/2604.04979
model: https://huggingface.co/KRLabsOrg/squeez-2b
dataset: https://huggingface.co/datasets/KRLabsOrg/tool-output-extraction-swebench
code: https://github.com/KRLabsOrg/squeez

If you are interested I can also post some examples / eval outputs.

submitted by /u/henzy123
[link] [comments]

Leave a Comment