cs.CR, cs.LG, stat.ML

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

arXiv:2604.12817v1 Announce Type: cross
Abstract: Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studi…