Author name: Pankayaraj Pathmanathan, Furong Huang

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Pankayaraj Pathmanathan, Furong Huang / April 17, 2026

arXiv:2604.09665v2 Announce Type: replace
Abstract: While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these a…

cs.AI, cs.LG

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Pankayaraj Pathmanathan, Furong Huang / April 14, 2026

arXiv:2604.09665v1 Announce Type: new
Abstract: While the wide adoption of refusal training in large language models (LLMs) has showcased improvements in model safety, recent works have highlighted shortcomings due to the shallow nature of these align…

cs.CL

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Pankayaraj Pathmanathan, Furong Huang / April 9, 2026

arXiv:2507.06419v2 Announce Type: replace
Abstract: Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due…