Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
arXiv:2605.06076v1 Announce Type: new
Abstract: The “Locate-then-Update” paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted par…