Author name: Seonglae Cho, Zekun Wu, Adriano Koshiyama

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Seonglae Cho, Zekun Wu, Adriano Koshiyama / May 5, 2026

arXiv:2508.12535v3 Announce Type: replace
Abstract: Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requir…

cs.AI, cs.CL, cs.LG

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Seonglae Cho, Zekun Wu, Adriano Koshiyama / May 5, 2026

arXiv:2602.10437v3 Announce Type: replace-cross
Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplif…