cs.AI, cs.CL

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

arXiv:2509.13281v5 Announce Type: replace
Abstract: Current safety evaluations of language models rely on benchmark-based assessments that may miss localized vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concep…