This is the third post in a sequence on substrates - the layers of computational context that allow AI to be implemented in real systems. The sequence expands on the concept of substrates as described in this paper and was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.

We claim that AI safety and security research currently has no clean way to reason about these. Post 1 introduced the intuitions for what substrates are and why they matter. Post 2 showed how substrate choices (LayerNorm placement, quantization format, DRAM topology) influence safety-relevant model properties in ways that are not capture by any standard toolkit. This final post introduces the formal framework.

In the previous posts, we looked at choices below the model architecture level (like normalization, weight encoding, and memory layout) and saw that they affect things we care about, like refusal behavior, robustness, and jailbreaks. But we didn’t have a clear way to say where these effects were coming from or how to describe them.

This makes it harder to think clearly and design good evaluations, because the current terms mix up different ideas. In this post, I try to separate these ideas so we can reason about and compare models more clearly. If you can’t name a gap, you can’t design around it. And you can’t compare two deployments of the “same” model without saying what “same” means in different settings. This post is a step toward that.

Four Things a Substrate Is Made Of

We’ll begin with a concrete example and introduce notation only after that. The 4-tuple we arrive at is not arbitrary: each part answers a different question, and the banking example will show why.

Alice wants to send Bob €500. She can do that in several ways. She can use her bank’s website, call the bank and speak to an operator, write a cheque, or use a payment app. The intended action is the same, but each option uses different syntax, different processing systems, and a different interface to the outside world.

If everything works, the abstract result is the same: Alice’s balance goes down by €500 and Bob’s goes up by €500. That is the part we care about for things like fraud detection or account reconciliation. In most cases, whether the transfer happened through a website or over the phone does not matter.

Now consider what happens when something goes wrong. A fraud detector that only watches web transactions will miss the same fraud done over the phone. A system that only tracks transaction IDs will miss cheque-based fraud entirely. What the evaluator sees depends on the interface it uses, not on the underlying behavior.

That is the basic idea the substrate formalism is meant to capture.

The Formal Definition

Now we can abstract. The banking example had four moving parts:

the set of ways Alice can describe her intended transfer (web form fields, spoken words, ink on a cheque), call this the language
the process that turns a described transfer into an actual change in account balances, call this the semantics map $⟦ ⟧$
the real-world costs of each channel (time to process, fees, error rates, staffing), call this the resource profile
the part of the system a monitor or auditor can actually see (the web transaction log, the phone call reference number, the cheque image), call this the observable interface

Packaging these four components together gives the definition.

A substrate is a 4-tuple $⟦ ⟧$ , where:

is the language, the set of syntactic expressions, encodings, or programs the substrate can accept;
$⟦ ⟧$ : is the semantics map, a function assigning to each syntactic object an abstract behavior , where is a fixed space of abstract behaviors;
captures resource profile, the computational budget available to the substrate (time, memory, energy, numeric precision, etc.);
is the observable interface, which determines which aspects of behavior in are externally visible, via an observation map .

. The set of syntactic objects a substrate accepts. For a Python interpreter, valid Python programs; for a GPU, compiled CUDA binaries; for a language model, the tokenized input plus the model parameters. is the description of the computation, not the computation itself. Different elements of can look unrelated syntactically and still lead to the same behavior.

$⟦ ⟧$ . The meaning map. It sends a syntactic object to an abstract behavior . The choice of depends on the setting: for programs, input-output functions; for language models, conditional next-token distributions or decision policies. This is where behavioral equivalence lives: $⟦ ⟧ ⟦ ⟧$ even if .

. This is the resource profile. Real computational systems always come with limits: time, memory, precision, parallelism, and so on. Two systems that look similar may still behave differently because of these constraints. For example, the same neural network run in float32 and bfloat16 may produce different functions. A trillion-parameter model may be possible in principle on paper, but impossible in practice to execute by hand. We use to capture these limits. Depending on the context, you can think of either as what the substrate cannot afford to do or as what it can afford to do. The key point is that is a separate part of the substrate, and it cannot be folded into or $⟦ ⟧$ without losing important information.

. This is the observable interface, and it may be the most important part for safety. The observation map tells us which parts of the abstract behavior are visible to a given evaluator. Different interfaces can expose the same behavior in very different ways. A safety evaluator reading model outputs on benchmark prompts is using one interface, . A researcher probing internal activations is using another, . A red-teaming system measuring refusal rates on a dataset is using yet another, . The same abstract behavior can produce different observations under these interfaces, and something that exists in may be visible through one interface but hidden through another.

The full computation pipeline that a substrate participates in can now be written out explicitly. Given an input :

is encoded: an encoding function produces .
The substrate interprets : the semantics map returns $⟦ ⟧$ .
The interface observes the behavior: .
The observation is decoded into an output: .

The end-to-end computation is the composite $⟦ ⟧$ .

The banking example maps onto this pipeline directly. is Alice's intent to transfer €500 to Bob. is the encoding of that intent, the button clicks, or the spoken words to the phone operator, or the ink on the cheque. $⟦ ⟧$ is the abstract financial transaction that results. is the particular monitoring system watching the channel. is whatever downstream action (or non-action) results from that observation.

What the Definition Clarifies

The first thing this definition separates is syntactic identity from behavioural identity. Two programs can be written in completely different languages, use very different amounts of memory, and still compute the same input-output function. The reverse can also happen: two programs can look almost identical syntactically but behave differently once the resource profile matters. A simple example is the same neural network forward pass run in float32 and in bfloat16. The code may look nearly the same and apply the same operations to the same tensors, but near the edge of the input distribution the outputs can differ because bfloat16 loses precision that float32 keeps. The difference is not in the high-level description alone, but in the substrate. Without separating the syntax from the resource profile, these two kinds of difference get mixed together.

The second thing the definition clarifies is that the same abstract behavior can be realized by multiple substrates. $⟦ ⟧$ and $⟦ ⟧$ may satisfy $⟦ ⟧ ⟦ ⟧$ for all inputs , even when , , and . This is the formal version of saying "Quicksort is Quicksort, whether you write it in Python or C++." The framework lets you make that statement precise and conditional: precise because it appeals to a shared , and conditional because the equation $⟦ ⟧ ⟦ ⟧$ may hold only on the inputs where the resource profile of each substrate is actually met. A pen-and-paper execution of a trillion-parameter network, for instance, is not so much behaviourally different as simply infeasible, and on infeasible inputs the semantics map is not well-defined to begin with.

The third clarification, and the most important for safety, is about observability. Consider two scenarios:

Scenario A: A model’s abstract behavior $⟦ ⟧$ includes a pattern of outputs that would count as a refusal failure for some inputs. But the evaluator’s interface only looks at benchmark prompts from a distribution that does not include those inputs. So maps the behavior to outputs that do not show the failure. The evaluator sees nothing wrong.

Scenario B: A researcher changes the interface to include the inputs that trigger the failure. The same abstract behavior now produces that reveals the problem.

In both cases, the underlying behavior in is the same. The difference is only in . The substrate framework makes this clear: the safety-relevant gap is between and , not between the model’s abstract behavior and some “true” behavior.

Modular Addition as the Main Worked Example

The main example comes from mechanistic interpretability. It is useful because the task itself is completely well understood modular arithmetic. The abstract behavior is the same across the models, but the way the models implement it is very different.

The Setup

Zhong, Liu, Tegmark, and Andreas (NeurIPS 2023) trained small transformer models to do modular addition: given two inputs and , predict . The task is fully specified. For any model that gets 100% accuracy, the target behavior is the same: the function from pairs of inputs to residues mod 59. In substrate terms, the models share the same target behavior in . What changes is how that behavior is implemented inside the model.

The Clock Algorithm

One group of models, called Model B, uses a standard one-layer transformer with attention. These models implement what the paper calls the Clock algorithm. The idea is simple: each integer from 0 to 58 is represented as a point on a circle, so that a number maps to an angle for some frequency . Addition then becomes angle addition. The network computes the right angle using trigonometric identities and the multiplicative structure of attention and then reads off the corresponding output number.

Concretely, each token is embedded in a way that encodes its angle on the circle. The attention mechanism produces products that combine information from the two input positions, and those products encode the sum through the angle-addition formula. The logit for a candidate output is then maximized at the correct residue.

The Clock algorithm needs multiplication between the inputs. It uses a special feature of attention: attention weights multiply the value vectors they route, so the model can combine terms across the two input positions. That kind of cross-token interaction is exactly what the trigonometric identity needs. So the Clock algorithm is the natural solution when attention is the main source of nonlinearity.

The Pizza Algorithm

Another group of models, Model A, uses constant or uniform attention. These models implement something different, which the paper calls the Pizza algorithm. Instead of working on the circle itself, this algorithm works inside it.

The key geometric fact is this: for a fixed target residue, all input pairs that produce that residue have midpoints that lie on one specific ray from the origin in the 2D embedding plane. These rays divide the disk into 59 “pizza slices,” which is where the name comes from. To decide the output, the network checks which slice the midpoint falls into.

The logit formula is different from the Clock case by a multiplicative factor. That factor makes the Pizza algorithm depend on the distance between the two inputs on the circle, while the Clock algorithm does not.

The Pizza algorithm only needs absolute-value nonlinearity, not multiplication. Once the midpoint is computed as a linear operation, the problem becomes checking which side of certain lines it lies on, and absolute value together with linear layers can do that cleanly. So ReLU layers can implement the whole pipeline without the cross-token multiplication that the Clock algorithm needs.

What This Means in Substrate Terms

Now let us map this onto the 4-tuple. Define two substrates:

$⟦ ⟧$

: the set of token-pair inputs .
$⟦ ⟧$ : the Clock forward pass circular embeddings followed by attention-mediated angle addition (i.e., the forward pass of a transformer with )
: architectural affordances of a full-attention transformer - cross-token multiplicative interactions are available
: input-output interface; reads off the argmax logit

$⟦ ⟧$

: the same token-pair inputs
$⟦ ⟧$ : the Pizza forward pass circular embeddings followed by midpoint-and-slice-detection (i.e., the forward pass of a transformer with )
: the resource profile of a linear-layer-dominant model (no multiplicative attention overhead)
: the same input-output interface

Both substrates achieve 100% accuracy. Sharing the same and the same target function, we have: $⟦ ⟧ ⟦ ⟧$ , for all . Observationally, the two substrates are identical. Their -level outputs are the same on every input.

But $⟦ ⟧ ⟦ ⟧$ Pizza as elements of . If we take to be the space of logit functions, the Clock substrate and the Pizza substrate realize different functions.

When does this matter? If we only look at the interface (which reads the argmax), both substrates look the same, they give the correct output on every input. But inside , they differ in important ways: the representations they use, the nonlinear operations, the gradient structure, how logits depend on , and how they behave under perturbations.

The paper shows this by using a different interface that probes internal gradients and logit patterns. Under , the two substrates are clearly distinguishable.

Distance and Morphisms

Having defined what a substrate is, we can now define two relational concepts: how far apart two substrates are, and what it means for one substrate to translate into another.

Distance Between Substrates

Given a substrate with encoding , define its realized behavior set as the image of all possible inputs under the full encoding-semantics pipeline: $⟦ ⟧$

This is the set of abstract behaviors can actually produce, not all behaviors in but the ones reachable given the substrate's language and encoding. A substrate with a very restricted encoding function has a small, realized behavior set; a substrate with a rich language has a large one.

We can then define a similarity measure between two substrates by comparing their realized behavior sets. Writing :

When , the two substrates realize exactly the same behaviors, they are behaviorally indistinguishable, even if their internal mechanisms differ completely. When , they have no behavioral overlap at all; nothing one can do the other can do. Values in between quantify partial overlap.

A concrete example: let be a linear classifier. Its realized behaviors are all linearly separable decision boundaries, a strict subset of all possible classifiers. Let be a two-layer neural network with ReLU activations. Since depth-2 ReLU networks are universal approximators in the limit, is much larger than and contains as a proper subset. In this case: .

The distance quantifies the capability gap, how much of 's behavioral repertoire is inaccessible to . A way to read in this case: any behavior is a task that can perform and cannot, a capability possessed by the richer substrate but absent from the poorer one.

In the Clock/Pizza example, if we take to be the space of logit functions, then contains just the Clock logit function and contains just the Pizza logit function. These are disjoint singletons in , so under this choice of . That means the two substrates realize different behavioral elements, even though they agree on the input-output task. If instead we choose to be the space of input-output maps, then both realized sets collapse to the same singleton and the distance becomes 1. So, the value depends on how we choose , and that choice is part of the framework.

One important subtlety is that this similarity is not a metric in the strict mathematical sense, because it does not always satisfy the triangle inequality. Two substrates can each be equally similar to a third substrate without being similar to each other. That is not a bug; it captures the idea that two models can each be equally close to a reference model in capability, while still being very different from one another. We present this as a proposed definition, and whether a true metric can be built from it is still open.

Conclusion

The formalism introduced here is meant to do one thing: give us precise vocabulary for the layer of computational implementation. A substrate $⟦ ⟧$ is a formal object that locates, separately and explicitly, the syntax a system ingests, the abstract behavior it produces, the resources it consumes, and the interface through which its behavior is observable. The distance and morphism notions tell us how to compare substrates and when one can be translated into another.

Appendix - A Worked Sorting Example

Let be the space of functions from finite integer lists to finite integer lists. We use sorting as a running example because it is one of the few settings where we can cleanly vary language, algorithm, and hardware substrate independently while holding the abstract behavior, a sorted list, fixed.

Substrate :

: valid Python programs
$⟦ ⟧$ : CPython interpreter (maps code to its function)
: CPU execution, float64, limited parallelism (GIL), no vectorization
: standard I/O; reads output from stdout

Substrate :

: valid C++17 programs
$⟦ ⟧$ : compiled binary (maps code to its function)
: native execution, SIMD available, no interpreter overhead
: same I/O; same observations

A Quicksort implementation in Python and in C++ yield $⟦ ⟧ ⟦ ⟧$ ,the same sorting function in . A semantics-preserving transpiler is a substrate morphism. differs substantially (interpreted vs. compiled), but the behavior in is identical.

Now replace Quicksort with Mergesort on the same substrate. In , both compute the same function: they map an unsorted list to a sorted list. But their interaction with , especially the CPU memory hierarchy, is different.

Quicksort sorts in place. It partitions around a pivot, recurses on subarrays, and accesses memory in a data-dependent pattern. This works well on small inputs that fit in cache, but on large inputs it causes more cache misses because access is irregular. Its recursion depth is on average and in the worst case.

Mergesort, by contrast, accesses memory sequentially during merging. That is exactly what hardware prefetchers are good at, so it often hides memory latency better on large inputs. The tradeoff is extra space: it needs scratch memory, which can itself create pressure on the cache.

So although both algorithms have the same semantics in , they consume resources differently through . The abstract behavior alone cannot see this; the difference lives in the interaction between the algorithm and the substrate.

That is exactly what the 4-tuple is meant to capture. A framework that only tracks $⟦ ⟧$ would treat Quicksort and Mergesort as equivalent. A framework that only tracks resources would miss that they compute the same thing. The substrate view keeps both visible.

For sorting, the realized behavior sets satisfy in , so all three have pairwise distance . Their differences, language overhead, access pattern, cache behavior are all in , and are invisible to an interface that only sees sorted output.

Discuss

Substrate: Formalism

Four Things a Substrate Is Made Of

The Formal Definition

What the Definition Clarifies

Modular Addition as the Main Worked Example

The Setup

The Clock Algorithm

The Pizza Algorithm

What This Means in Substrate Terms

Distance and Morphisms

Distance Between Substrates

Conclusion

Appendix - A Worked Sorting Example

Leave a Comment