Finding and Reactivating Post-Trained LLMs’ Hidden Safety Mechanisms
arXiv:2604.00012v1 Announce Type: new
Abstract: Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs)…