cs.CL, cs.LG

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

arXiv:2508.00161v3 Announce Type: replace
Abstract: The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activ…