Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering
arXiv:2604.05179v1 Announce Type: new
Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Pr…