I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchmarks, safety evaluation, distribution shift, and weight-level analysis. This post covers Qwen3.6-27B, comparing five abliteration variants against the base model. I recovered safetensors from HauhauCS's Q8_K_P GGUF, then ran 85 hours of benchmarks, HarmBench, KL divergence, and weight forensics across all six. Heretic and Huihui are the top two for capability preservation: Huihui has the smallest benchmark deltas, Heretic has the lowest KL divergence. All five abliterated models reach near-complete safety removal. AEON's "enhanced capabilities" claim is contradicted by the data. Abliterix has the worst capability preservation by far. Full report with all tables and charts: HuggingFace model card.
The six models
| Name | Type |
|---|---|
| Base | Qwen/Qwen3.6-27B |
| Heretic | llmfan46/Qwen3.6-27B-uncensored-heretic-v2 |
| HauhauCS | HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive |
| Huihui | huihui-ai/Huihui-Qwen3.6-27B-abliterated |
| AEON | AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 |
| Abliterix | wangzhang/Qwen3.6-27B-abliterated-v2 |
HauhauCS used a tool called "Reaper Abliteration," which was shown to be plagiarised from Heretic under AGPL-3.0 with all attribution stripped and relicensed to PolyForm Noncommercial. Based on our analysis of the recovered source code, Reaper adds subspace rank-k ablation, per-component continuous curves, and SOM clustering on top of the Heretic-derived core. The model was exported as Q8_K_P GGUF. I converted it back to safetensors with ungguf, our GGUF-to-safetensors tool. The weights therefore carry two layers of modification: Reaper's abliteration edits and GGUF quantisation round-trip noise, superimposed.
I will discontinue HauhauCS in all future comparisons. Without proper safetensors and the tool being plagiarized, there's no point. The lossless claims are debunked in every model and the tool Reaper Abliteration is open for anyone to see how the models are created.
Benchmarks
Evaluated with lm-evaluation-harness via vLLM 0.19.0, BitsAndBytes 4-bit quantisation on a single RTX 5090. All six models tested with identical settings. BNB4 quantisation drops absolute scores but preserves relative deltas between variants.
| Task | Base | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|---|
| MMLU | 83.3% | 82.8% | 83.9% | 83.4% | 82.9% | 81.3% |
| HellaSwag | 83.5% | 83.2% | 83.1% | 83.5% | 82.7% | 77.3% |
| ARC Challenge | 59.1% | 58.0% | 57.9% | 59.5% | 56.1% | 53.2% |
| WinoGrande | 77.7% | 77.7% | 77.7% | 77.4% | 75.3% | 74.9% |
| TruthfulQA MC2 | 56.7% | 51.1% | 47.2% | 54.8% | 46.1% | 48.7% |
| PiQA | 81.0% | 81.0% | 81.0% | 81.2% | 80.4% | 75.7% |
| GSM8K (7168 tok) | 34.4% | 27.5% | 51.0% | 75.1% | 51.2% | 37.6% |
| Lambada (ppl) | 3.18 | 3.24 | 3.35 | 3.15 | 3.44 | 9.12 |
Delta vs base
| Task | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|
| MMLU | -0.5 | +0.6 | +0.1 | -0.4 | -2.0 |
| HellaSwag | -0.3 | -0.4 | +0.0 | -0.8 | -6.2 |
| ARC Challenge | -1.1 | -1.2 | +0.4 | -3.0 | -5.9 |
| WinoGrande | +0.0 | +0.0 | -0.3 | -2.4 | -2.8 |
| TruthfulQA MC2 | -5.6 | -9.5 | -1.9 | -10.6 | -8.0 |
| PiQA | +0.0 | +0.0 | +0.2 | -0.6 | -5.3 |
| GSM8K | -6.9 | +16.6 | +40.7 | +16.8 | +3.2 |
Charts: Benchmark Comparison | Delta Chart
HarmBench
HarmBench with 400 textual behaviours, max_tokens=6144, classified with CoT direction analysis. Verified by three independent LLM reviewers.
| Variant | ASR | Empty | Full CoT ASR |
|---|---|---|---|
| Base | 25.8% | 1 | 26.0% |
| Huihui | 98.5% | 5 | 99.8% |
| HauhauCS | 94.5% | 22 | 100.0% |
| Abliterix | 94.5% | 22 | 100.0% |
| Heretic | 92.5% | 30 | 100.0% |
| AEON | 88.8% | 45 | 100.0% |
Four of five reach 100% Full CoT ASR. The reported ASR differences come from how much the 6144-token generation budget is consumed by chain-of-thought reasoning before the visible response. When the budget is exhausted, the response is empty and the classifier marks it as a refusal. This understates the true ASR.
Charts: HarmBench Summary | By Category
KL Divergence
Lower is better. Measures output distribution shift from base on benign prompts.
| Variant | KL (batchmean) | Rating |
|---|---|---|
| Heretic | 0.0037 | excellent |
| Huihui | 0.0074 | excellent |
| Abliterix | 0.0222 | very good |
| AEON | 0.0238 | very good |
| HauhauCS | 0.0242 | very good |
All five are well below the capability damage threshold at KL around 0.1.
Weight Analysis
This is where things get interesting.
| Metric | AEON | Abliterix | Heretic | Huihui | HauhauCS |
|---|---|---|---|---|---|
| Tensors changed | 88 (10.4%) | 101 (11.9%) | 120 (14.1%) | 128 (15.1%) | 564 (66.4%) |
| Relative edit | 6.0% | 5.2% | 2.1% | 1.5% | 0.7% |
HauhauCS is an extreme outlier with 4.4-6.4x more changed keys than any other variant. This is the combination of Reaper's abliteration targeting multiple component types plus GGUF Q8_K_P round-trip noise. A uniform ~0.57% relative edit is visible across all tensor types, including types that other methods don't touch like embed_tokens and q_proj. The abliteration signal sits on top of this noise floor.
Pairwise cosine similarities between the four other techniques are mostly below 0.07. No two techniques discovered the same weight direction. The "refusal direction" in weight space is not a single vector but a manifold with many viable removal pathways.
What stands out
Heretic has the lowest KL divergence at 0.0037, rated "excellent." Smallest weight footprint at 2.1% relative edit. Smallest GSM8K loss at just -6.9pp. Achieves 100% Full CoT ASR. 120 tensors, 3 types.
Huihui has the smallest benchmark deltas. Average delta on non-GSM8K tasks is just 0.5pp, beating Heretic's 1.3pp. Wins 4 of 6 non-GSM8K tasks head to head. Highest reported ASR at 98.5% with the fewest empty responses at just 5. KL divergence is 0.0074, also rated "excellent." But GSM8K at 75.1% is a +40.7pp jump over base. No abliteration should improve reasoning that much. We have double-checked these results and would be interested to see independent benchmarks from others.
HauhauCS has solid behavioural results despite the complex weight fingerprint. MMLU is +0.6pp over base. 94.5% ASR going to 100% Full CoT. The Reaper abliteration plus GGUF noise doesn't meaningfully damage output distributions. The "lossless" claim is simply not evident when Heretic and Huihui both preserve capabilities better. The weights themselves carry Reaper's abliteration edits plus quantisation artefacts.
AEON degrades on every non-GSM8K task. TruthfulQA drops 10.6pp. ARC drops 3.0pp. Has the worst thinking loops with 45 out of 400 empty responses. Claims "no looping, no philosophizing spirals" and "measurably enhanced capabilities" are contradicted by the data.
Abliterix has the worst capability preservation. Lambada perplexity increases 2.9x from 3.18 to 9.12. HellaSwag drops 6.2pp. Concentrated surgical strikes with extreme outliers cause broad collateral damage.
What went wrong
85 hours of productive GPU time across 7 days. Plus ~25 hours lost to failed runs. 14 failed runs total.
The bulk were GSM8K timeouts. Qwen3.5 architecture is incompatible with BNB4 plus tensor parallelism. The default 120s request timeout was too short for extended reasoning. Wrote a patched script with 900s timeout to fix it. Accidentally re-ran AEON HarmBench with max_tokens=4096 instead of 6144. 6.7 hours wasted.
GSM8K per-model times vary dramatically because abliterated models think harder on math problems. HauhauCS took 53 minutes. AEON took 11 hours.
Methodology notes
All models evaluated with BitsAndBytes 4-bit quantisation on a single RTX 5090. Absolute scores are not directly comparable to bf16 results. Relative deltas between variants are preserved. GSM8K scores use flexible-extract matching. Treat GSM8K numbers as relative comparisons only.
The thinking budget matters. Initial runs with max_gen_toks=2048 gave terrible GSM8K scores because for reasoning models, max_gen_toks includes thinking tokens. The model would think for 1900 tokens, get cut off, and never produce an answer. Re-running with max_gen_toks=7168 gave the results above.
Summary table
| Metric | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|
| HarmBench ASR | 92.5% to 100% | 94.5% to 100% | 98.5% to 99.8% | 88.8% to 100% | 94.5% to 100% |
| MMLU | 82.8% | 83.9% | 83.4% | 82.9% | 81.3% |
| GSM8K | 27.5% | 51.0% | 75.1% | 51.2% | 37.6% |
| KL divergence | 0.0037 | 0.0242 | 0.0074 | 0.0238 | 0.0222 |
| Avg delta excl GSM8K | 1.3pp | 2.0pp | 0.5pp | 3.0pp | 5.0pp |
| Tensors changed | 120 | 564 | 128 | 88 | 101 |
Links
Full report with provenance analysis, tensor breakdown, and all charts: HuggingFace model card
Forensics toolkit: Abliterlitics on GitHub
GGUF-to-safetensors converter: ungguf on GitHub
Other tensor comparisons: DreamFast HauhauCS collection
While I have taken the time to verify all results thoroughly, I am open to any corrections, additional benchmarks, or further analysis. If you spot something that looks wrong and can be confirmed, I am happy to fix it.
[link] [comments]