Computation in Superposition: Two Handcrafted Models

Many interpretability researchers (ourselves included) believe that neural networks store knowledge in superposition—that is, networks encode more facts than they have individual components. A natural extension of this idea is that networks also perform computation on knowledge that lives in superposition. Despite the centrality of this concept, there are few concrete examples of what computation in superposition actually looks like in practice.

In this post, we study a toy memorization task where a network must recognize valid first-name/last-name pairs. We first construct a handcrafted network that solves the task by performing computation in superposition. Then, we describe experiments testing whether a trained network implements the same mechanism. The trained model uses, in part, the predicted mechanism, but it also includes neurons that employ a different strategy that does not rely on superposition at all. Based on this finding, we construct a second handcrafted network that captures this learned mechanism in its purest form, using just two neurons to memorize an arbitrary number of name pairs.

The contrast between these two mechanisms is the main point of this post. Superposition is one strategy a network might use, but it's not the only one. Even in our restricted setting, trained networks can mix superposition-based computation with clever encodings that sidestep superposition entirely. Understanding both kinds of algorithms gives us a sharper vocabulary for asking not only what a network knows, but how it uses what it knows, which may ultimately help us identify these mechanisms in larger, more capable models where the safety implications are more pressing.

Problem Statement

We wish to memorize the names of eight famous athletes. First names and last names are each one-hot encoded. The input consists of one valid first name concatenated with a valid last name. The network outputs 1 when the pair corresponds to a famous person and 0 otherwise. The list of names is below:

Handcrafted Network #1: The Additive Method

Each neuron recognizes partial evidence for several different athletes. We begin by describing the role of individual neurons, then explain how they work together, and finally offer some observations about the network's properties.

Individual Neurons

In our handcrafted network, each neuron is partially responsible for recognizing half of the famous athletes (four out of the eight). Each neuron is designed to fire when the one-hot encodings for both the first name and last name of its assigned athletes are present. Here is one such neuron, which fires on the first four athletes (note the bias of −1, so that the neuron only fires when recognizing a first AND a last name):

ReLU(Babe + Serena + Peyton + Lionel + Ruth + Williams + Manning + Messi − 1)

When we pass in the name "Babe Ruth," the neuron activates:

ReLU(1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 − 1) = 1

The neuron receives +1 from "Babe," +1 from "Ruth," and −1 from the bias, yielding a total activation of 1.

The neuron also fires for "Peyton Manning":

ReLU(0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 − 1) = 1

However, note, the neuron also fires for certain non-famous combinations such as "Peyton Ruth":

ReLU(0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 − 1) = 1

We will see shortly how combining multiple neurons resolves these false positives.

Combining All Neurons

We construct a layer made of six such neurons, designed so that each neuron is partially responsible for four different athletes and each athlete activates exactly three distinct neurons. The athletes assigned to each neuron are shown below:

Given this arrangement, we set so that each neuron votes +1 for famous when it fires and +0 otherwise. We set the unembed biases to −2.5 and output “famous” whenever the model's final score is greater than 0.

Here are some worked examples:

Peyton Manning (famous):

Neurons 1, 3, 5: ReLU(1 + 1 − 1) = 1 → fire
Neurons 2, 4, 6: ReLU(0 + 0 − 1) = 0 → do not fire
Famous score: 1 + 0 + 1 + 0 + 1 + 0 − 2.5 = 0.5
The network outputs famous. ✓

Peyton Ruth (not famous):

Neurons 1, 3: ReLU(1 + 1 − 1) = 1 → fire
Neurons 2, 5: ReLU(1 + 0 − 1) = 0 → do not fire
Neurons 4, 6: ReLU(0 + 0 − 1) = 0 → do not fire
Famous score: 1 + 0 + 1 + 0 + 0 + 0 − 2.5 = −0.5
The network outputs not famous. ✓

Babe Jordan (not famous):

Neurons 1–6: ReLU(1 + 0 − 1) = 0 → none fire
Famous score: 0 + 0 + 0 + 0 + 0 + 0 − 2.5 = −2.5
The network outputs not famous. ✓

Properties of This Network

Examining the first three and last three neurons separately reveals two 3-bit binary encodings. We can extend this construction to famous first/last name pairs with neurons. For 4 famous people, this yields 4 neurons—no better than the non-superposition case where each neuron remembers a single fact. However, as the number of famous people grows, the neuron count scales only logarithmically. For 64 famous people, we only need 12 neurons. (With a different number of neurons, we use a different famous cut-off bias.)

It is also worth noting that some non-famous names narrowly miss the threshold (scoring −0.5 for "famous"), while others trigger no neurons at all and miss by a wider margin (−2.5). This variation in confidence for incorrect inputs gives a toy example of how models can be wrong with different degrees of confidence in unexpected ways, and may have implications for understanding hallucination-like behavior in larger models.

Does This Actually Happen in Trained Networks?

The natural next question is whether networks actually implement this kind of mechanism. To do this, we trained a six-neuron, single-layer network on the proposed task. This gave the model enough capacity to implement our solution, with what we thought was minimal flexibility to do much else. As with many experiments, the model still did something else - splitting its neurons across two strategies. A couple of neurons did vote additively for famous pairs, using the mechanism from our first handcrafted network. Others organized into pairs that voted against non-famous combinations, implementing an approach we hadn't anticipated. The remainder of this section focuses on that second strategy.

For these neuron pairs, the trained network implements the following algorithm:

Each first name is assigned an arbitrary score. The paired last name is assigned the opposite score. This process is repeated for all pairs, ensuring all scores are unique. The network uses these scores in two paired neurons: $\text{ReLU}(\text{first_name_score} + \text{last_name_score})$ and $\text{ReLU}(-\text{first_name_score} - \text{last_name_score})$ - both of which vote for "not famous."

Worked Example

Suppose we assign the following scores:

Name	Score
Babe	+1
Ruth	-1
Peyton	+2
Manning	-2
Michael	+3
Jordan	-3

Here are some worked examples:

Input	score	ReLU(score)	ReLU(-1 * score)	Classification
Babe Ruth	1 - 1 = 0	0	0	famous ✓
Peyton Manning	2 - 2 = 0	0	0	famous ✓
Peyton Ruth	2 - 1 = 1	0	1	not famous ✓

The key insight is that famous pairs have scores that cancel perfectly, producing zero activation in both neurons. Non-famous pairs, however, always produce nonzero activations in exactly one of the two, and this active neuron generates a "not famous" signal.

Handcrafted Network #2: The Subtractive Method

We realized that this mechanism, on its own, could solve the entire problem. Using it, we constructed a second handcrafted network that implements the subtractive approach using two neurons only. This demonstrates a remarkably efficient encoding: just two neurons can memorize an arbitrary number of name pairs, provided that no first or last name appears in more than one pair. The trick is surprisingly simple and intuitive: famous first and last name pairs exactly cancel each other out, while not famous pairs have some nonzero mismatch value that can be detected. Geometrically, we can think of the model as embedding each pair in a 2-d plane corresponding to the 2 neurons pre-activations.

The model separates “famous” and “not famous” by placing all inputs along a line, with the nonlinear activation providing a kink right where all of the famous pairs lie.

Additional Observations

When we train a network with only two LeakyReLU neurons and initialize them with negative scores, we can reliably recover the name memorization behavior. The same is possible without using negative initialization, but requires additional help to overcome local minima, such as aggressive cosine annealing or noise from minibatching. Despite needing some special conditions to reach this solution reliably, we find the model is relatively stable to perturbation.

We tested stability by perturbing all learned parameters with normally distributed noise whose standard deviation was proportional to the magnitude of each parameter. When the noise scale is ~10% of the parameter magnitude, the noise is strong enough to cause a drop in classification accuracy. However, the model returns to the perfect solution on further training. Noise above this threshold becomes increasingly difficult to recover from. This all suggests that the two-neuron solution is stable and reachable, but not always easy for training to find.

Conclusion

This post presented two methods by which networks might store and retrieve facts. Our first handcrafted example relied on a superposition of facts, using an additive mechanism: neurons fire when both components of a famous name are present, and a voting threshold determines the output. The second mechanism was subtractive: famous pairs are encoded such that their scores cancel, while non-famous pairs produce residual activations that vote against recognition. The subtractive solution wasn't intuitive to us at first. We initially assumed superposition was doing all the work, but once we saw the subtractive mechanism, the geometry felt obvious in hindsight: the network simply places all inputs on a line and lets the ReLU kink do the classification.

Our experiments confirmed that the subtractive mechanism arises in trained networks. In the two-neuron setting, it appeared on its own. In the six-neuron setting, the trained model landed on a hybrid, using some neurons to vote additively for famous pairs and others to vote subtractively against non-famous ones. Even on a task this simple, the network wove superposition-based computation together with a decision-boundary trick rather than committing to one strategy.

We hypothesize that the pure additive mechanism also occurs in practice and leave its identification as a direction for future work. In particular, the subtractive method only produces a scalar "famous / not famous" verdict - it does not leave behind a representation of which famous person was identified. In more complex networks that need to perform further computation on the recognized entity (e.g., retrieving associated facts, routing to downstream circuits), the additive mechanism's per-person neuron activations may be necessary, since they preserve identity information that the subtractive collapse discards.

The broader lesson is that memorized facts can be used in more than one computational form. Sometimes a network may compute over facts stored in superposition. Other times, it may find a lower-dimensional encoding that solves the immediate task while discarding information that could matter for downstream computation. Mapping when and why networks reach for each strategy and what hybrids they construct in between is a productive next step toward understanding computation in superposition in the wild.

Contributions

Rick developed the first handcrafted model. Kyle and Rick jointly ran experiments and worked together to understand the second mechanism. Rick wrote the majority of the writeup; Kyle wrote the Additional Observations section. The writeup was jointly edited by Kyle and Rick.

Discuss