Anthropic has released new research to show what an LLM is thinking when generating a next token using NLA or "Natural Language Autoencoders", the NLAs are a pair to LLMs that can translate internal thoughts of LLM for any specific token.
They have also release NLA model weights for Gemma 3 27b instruct at:
- Auto Verbalizer (AV): https://huggingface.co/kitft/nla-gemma3-27b-L41-av
- Activation Reconstructor (AR): https://huggingface.co/kitft/nla-gemma3-27b-L41-ar
And neuronpedia is currently hosting them on their site at https://www.neuronpedia.org/gemma-3-27b-it/nla
So you go to neuronpedia link above, ask Gemma 3 a question, then click on any token and click explain, and the site will show you what the model was thinking when generating that token
Auto Verbalizer (LLM) is what translates LLM's activations to readable text, Activation Reconstructor is just to verify if the text generated by AV can be translated back to LLM activations.
[link] [comments]