Uncategorised

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

AbstractWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text descr…