cs.CL

A Theoretical Game of Attacks via Compositional Skills

arXiv:2605.01034v1 Announce Type: new
Abstract: As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still b…