Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]

This is my project proposal for Pivotal. Apply as a mentee by May 3rd

The field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)

This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:

  1. We know the data-generating process (DGP) - what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.
  2. Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).
  3. This is a transformer on a language task - results learned here straightforwardly apply to real LLMs
  4. Modifiable complexity - we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).

Specific Research Directions

  1. Improve the DGP (data-generating process) - Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.
  2. Interp-across-time - are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)
  3. Building interp tools - What techniques (existing or novel) can be used to find these ground truth features?
  4. Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.
  5. Tensor Interp - because we're using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)

High Level View

I'm shooting for a healthy feedback loop of:

  1. Use existing computational vocab (eg induction) to make a toy LLM
  2. Use (1) to improve our basic knowledge of models (eg suppression) and learn new computational vocab
  3. Repeat
  4. ...
  5. Profit

If we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.

If this interests you, do apply to my (& Thomas') research stream (by May 3rd).

Current Trained Model

As an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed: 

image.png

There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for:

image.png
  • alice → sees(70%), helps(20%), finds(10%)
  • bob → knows(70%), likes(20%), meets(10%)
  • carol → calls(70%), tells(20%), sees(10%)
  • Etc

We can also look at the slice of the QK circuit:

image.png

For 

Skip-bigrams (4 rules, max_skip=8):

  •   beach ... big → at
  •   garden ... old → and
  •   lake ... new → or
  •   office ... small → to

Zooming in, you can clearly see two here:

image.png

But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).



Discuss

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top