Disillusionment with mechanistic interpretability research [D]

Hey all, apologies if this is the wrong place to post this. I'm currently an undergrad computer scientist that got swept up in the mechanistic interpretability wave c. 2024 or so (sparse autoencoders, attribution graphs) and found it generally promising (and still do); that being said a lot of the new research out of Anthropic (which I understand as the mech interp house) doesn't sit well with me.

They recently published a blogpost on so called "natural language autoencoders" -- training one LLM to compress activations into a natural language description and another LLM to get the activations back which seems extremely suspect -- for starters it's a black box technique (which to me makes the proposition that it helps understand model internals very weak), but they also do not compare basic metrics (FVE, reconstruction error) against SAE baselines. Moreover the paper mentions so called "confabulations", when the "activation verbalizer" module just makes up stuff in explaining the activations, which to me defeats the entire purpose of the concept since you may never know whether or not an explanation is confabulated at test time.

Granted, the blogpost mentions most of these issues, and they do seem to achieve good results on a misaligned model auditing benchmark (though the utility of this again seems dubious to me, I've never been one for AI x-risk arguments), but it seems overall that Anthropic, especially recently, don't care so much about interpretability as they do scalable alignment/oversight, and are happy to satisfy the former if it means better progress on the so called control problem. Given how closely the field seems to track Anthropic's movements, I'm concerned that this is where mech interp is heading

Let me know if this is the wrong place to post this.

submitted by /u/Carbon1674
[link] [comments]

Leave a Comment