GLM is the most schizophrenic model so far on plan-mode. It both under-clarifies on clear ambiguity (4 audit prompts) AND over-clarifies on degenerate inputs (whitespace, single char) AND over-clarifies on multi-turn answers (reclarify_partial_answer — the user provided answers, model asked again). Both directions failing in the same run suggests GLM doesn't have a stable internal sense of "is this ambiguous or not" — it just has a "should I ask?" coin flip with the bias varying by input shape.
Anyways here are some benchmark on tool calling
| Model | plan_mode | plan_mode_stress | tool_calling | file_generation | Combined |
|---|---|---|---|---|---|
| qwen/qwen3-coder-next (Q8) | 12/13 (92%) | 32/38 (84%) | 18/20 (90%) | 4/6 (67%) | 66/77 (86%) |
| google/gemma-4-26b-a4b | 11/13 (85%) | 30/38 (79%) | 17/20 (85%) | 4/6 (67%) | 62/77 (81%) |
| zai-org/glm-4.7-flash | 9/13 (69%) | 27/38 (71%) | 18/20 (90%) | 4/6 (67%) | 58/77 (75%) |
| qwen/qwen3-next-instruct-80b (Q6) | 12/13 (92%) | 28/38 (74%) | 15/20 (75%) | 2/6 (33%) | 57/77 (74%) |
[link] [comments]