One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
arXiv:2604.13006v1 Announce Type: cross
Abstract: Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single p…