Abliterated version of the new Qwen3.6-35B-A3B up on HF

Pushed an abliterated Qwen3.6-35B-A3B to HF. Worth noting because MoE abliteration is genuinely different from dense — the refusal signal lives in the expert path, not attention, so standard Q/K/V LoRA doesn’t cut it.

Approach (Abliterix framework):

LoRA rank-1 on O-proj + MLP down-proj (Q/K/V disabled on purpose)
Expert-Granular Abliteration: project refusal direction across all 256 expert down_proj slices per layer
MoE router suppression: identified top-10 “safety experts”, router bias -2.10
Orthogonalized steering vectors + Gaussian decay across layers
Strength search in [0.5, 6.0] to avoid degenerate output

Eval: 7/100 refusals, KL 0.0189 from base. Baseline is 100/100. Judge is Gemini 3 Flash — degenerate/garbled output counts as refusal, no keyword matching, 150-token generations.

One thing worth saying since this comes up a lot: a bunch of abliterated model cards claim 0–3/100 refusals, and most are using 30–50 token generations + keyword detection. That undercounts delayed/soft refusals and lets garbled output pass as “compliant.” 7/100 is what a stricter LLM-judge eval actually gives you. Take the flashy numbers with salt.

huggingface/wangzhang/Qwen3.6-35B-A3B-abliterated

Research only. Safety guardrails removed — use responsibly.

submitted by /u/Free_Change5638
[link] [comments]

Leave a Comment