cs.LG

Exemplar Partitioning for Mechanistic Interpretability

arXiv:2605.14347v1 Announce Type: new
Abstract: We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^{3}\times$ fewer tokens than compar…