cs.AI, cs.CR, cs.LG

Self-Mined Hardness for Safety Fine-Tuning

arXiv:2605.03226v1 Announce Type: new
Abstract: Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt’s difficulty by how often the target model’s own rollouts…