Ziqian Zhong, Aditi Raghunathan

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong, Aditi Raghunathan / April 22, 2026

arXiv:2508.00161v3 Announce Type: replace
Abstract: The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activ…

Author name: Ziqian Zhong, Aditi Raghunathan

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs