SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
arXiv:2508.11290v4 Announce Type: replace
Abstract: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that seemingly resemble harmful content. This phenomenon diminishes utility in prod…