Questions regarding abliteration / censorship removal

Hello everyone. I just thought of something that seems so obvious but from what I’ve been able to find it doesn’t seem like anyone has done it or at least not openly disclosed it if they have. Abliterated models seem to be getting much much better especially with new technology like heretic (Shoutout u/-p-e-w- 😎) but unfortunately abliterated models still suffer a noticeable drop in quality and coherence. I’ve never used an abliterated model that didn’t show at least some signs of degradation but anyone besides back in the days of llama 3 we had Orengutang’s Lexi Llama 8b but I’m getting off topic. What I’m proposing here is why don’t we use the abliterated models to generate responses that would have been refused and then generate the same set of responses with the base model, and then just do a DPO run on the base model? As far as I know, this would be much better as you would be training out the refusals from the model but also not damaging any tensors that may lead to undesirable side effects / change model behavior in any way besides removing refusals. Has anybody tried this before? Is there something I’m missing here? Any feedback is appreciated. I’m going to try this with Qwen 3.5 122b A10b later tonight and post the results but if someone wants to save me the time and explain why it won’t work out that would also be appreciated.

submitted by /u/WyattTheSkid
[link] [comments]

Leave a Comment