When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
arXiv:2604.19001v1 Announce Type: new
Abstract: Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm do…