I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON.

The idea: every time GPT-2 generates a token, its residual stream gets passed through a Sparse Autoencoder (Joseph Bloom's pretrained SAE). The SAE decomposes it into human-interpretable feature: hings like "European geography", "capital cities", "French language" and streams those to the browser over WebSocket, where they show up as a live 3D force graph.

Nodes = SAE features. Edges = features that fired together on the same token. Node brightness = activation strength. The whole graph evolves token by token.

What surprised me most: type "The capital of France is" and you can literally watch geography features, proper noun features, and completion-pattern features light up before the word "Paris" even gets generated. It's not what the model outputs that's interesting it's what's happening right before it decides.

Stack: TransformerLens + SAELens on the backend, FastAPI WebSocket for streaming, Three.js + 3d-force-graph on the frontend. Runs on CPU (~800ms/token) or GPU (~35ms on a 4050). Labels come from Neuronpedia's API and get cached locally.

You can also swap in other models — GPT-2 medium/large/xl, Pythia variants, Gemma-2-2B — as long as there's a pretrained SAE for it in SAELens.

GitHub: https://github.com/09Catho/axon

Would love feedback and stars especially from anyone who's worked with SAEs before curious whether the co-activation edges are actually meaningful or just noise at this layer.

submitted by /u/Financial_World_9730
[link] [comments]

Leave a Comment