Harshvardhan Saini, Yiming Tang, Dianbo Liu

Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control

Harshvardhan Saini, Yiming Tang, Dianbo Liu / April 23, 2026

arXiv:2601.02896v2 Announce Type: replace
Abstract: Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dil…

Author name: Harshvardhan Saini, Yiming Tang, Dianbo Liu

Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control