Why MOE below A10b feels like im gambling

We've seen lots of MOE's coming out recently. While these do phenominal work at speed you pay the price in coherence.. unless the MOE has at least 10b active-per-token.
I often coded with these models and have been trying many different models the most recent i've found is:
qwen3-coder-next, qwen3.5-35b, qwen3.6-35b
and none of them come close to the level of stability i witnessed in qwen3.5-27b even qwen3.6-35b-A3b??

WhileThe A3b MOE can solve the problem he often needs hand-holding and multi-turn steering. the A3b often try to use tools avalible in the Coding Harness that doesn't apply to the problem hes trying to fix. so i often have to manually disable some tools to keep him focuses while the 27b would intuitively sucessfully ignore the irrelavent tools ETC. This is just one example. But the variability of what the model will chosse to do next is hugely varied with active 35b-A3b compared to 27b dense. I would like to use the MOE but im struggloing to find a usecase for where i would put it in my agentic workflow.

Edit: english is hard. but u get what im saying? at least i'll leave the typos as proof this isnt a bot account. LOL

submitted by /u/Express_Quail_1493
[link] [comments]

Leave a Comment