Shaopeng Fu, Di Wang

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu, Di Wang / April 15, 2026

arXiv:2604.12817v1 Announce Type: cross
Abstract: Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studi…

Author name: Shaopeng Fu, Di Wang

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory