Transformers Without the RNN

The previous post walked through attention as an add-on to RNNs — a way to let the decoder look back at encoder hidden states instead of relying on one compressed vector. This post covers what happened when that add-on became the entire model.

In 2017, the paper “Attention Is All You Need” introduced the Transformer. It had both an encoder and a decoder, just like the RNN-based Seq2Seq models before it, but with the recurrence stripped out entirely. Self-attention replaced the sequential processing. Everything ran in parallel.

But the field quickly split the architecture in half. BERT used only the encoder while GPT used only the decoder. Most major models since then pick one half and build on it.

But why would we throw away half the architecture?

Writing it out the code by hand showed me how encoding and decoding do different things, and why we rarely need both.

The Encoder — What “Understanding” Looks Like in Code

An encoder takes an input sentence and compresses each token into a dense numerical vector that captures its meaning based on the entire input. It’s the part of the architecture that “understands.”

It does this using multiple layers of self-attention and feed-forward networks, each enabled by learned weight matrices. The code below shows one layer.

In practice, models like BERT stack 12 of these on top of each other. The output of one layer becomes the input to the next, building increasingly abstract representations.

I’ll show the code first, then explain what’s happening.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Sample input: 2 sentences, 5 tokens each, 4-dimensional embeddings
x = torch.randn(2, 5, 4) # (batch_size, seq_len, embedding_dim)
hidden_dim = 8

# Step 1: Project input into Q, K, V
W_Q = nn.Linear(4, hidden_dim, bias=False)
W_K = nn.Linear(4, hidden_dim, bias=False)
W_V = nn.Linear(4, hidden_dim, bias=False)

Q = W_Q(x) # (2, 5, 8)
K = W_K(x) # (2, 5, 8)
V = W_V(x) # (2, 5, 8)

# Step 2: Attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (hidden_dim ** 0.5) # (2, 5, 5)

# Step 3: Softmax
attention_weights = F.softmax(scores, dim=-1) # (2, 5, 5)

# Step 4: Weighted sum of values
attention_output = torch.matmul(attention_weights, V) # (2, 5, 8)

# Step 5: Feed-forward network
ffn = nn.Sequential(
          nn.Linear(hidden_dim, hidden_dim * 4),
          nn.ReLU(),
          nn.Linear(hidden_dim * 4, hidden_dim)
      )

output = ffn(attention_output) # (2, 5, 8)

Summary of what just happened:

Now, what’s going on. For each token embedding x, we multiply it by three learned weight matrices to get queries, keys, and values:

Q = W_Q(x) — “what am I looking for?”
K = W_K(x) — “what do I contain?”
V = W_V(x) — “what information can I give?”

An analogy from my notes that helped: imagine every word walks into a meeting with three things.

Q is a question it wants to ask
K is a description of what it knows.
V is a bundle of information it’s willing to share.

The attention mechanism figures out whose questions match whose descriptions, then redistributes the information accordingly.

In the code, each weight matrix is created with:
nn.Linear(embedding_dim, hidden_dim, bias=False)

This is a learnable linear transformation — it implements the function Linear(x) = xWᵀ. The weight matrix starts with random values and gets updated during training through backpropagation: the model computes attention, predicts an output, measures the loss, computes gradients, and updates the weights.

Q, K, and V aren’t fixed, they improve as the model trains.

The attention mechanism then computes:
Attention(Q,K,V) = softmax(QKᵀ / √dₖ) × V

The dot product QKᵀ gives alignment scores of how well each query matches each key.
Softmax normalizes these into attention weights.
Those weights are used to take a weighted sum of the values.

This is the same dot-product-then-softmax math from the previous post, except now every token is querying every other token in the *same* sequence. There’s no separate decoder asking questions — the sentence is talking to itself.

After attention, the result passes through a feed-forward network:
FFN(x) = ReLU(xW₁ + b₁)W₂ + b₂.

This is applied to each token independently, refining the representation.

Notice the score matrix shape (batch, seq_len, seq_len). That’s a 5×5 matrix for our 5-token input, where every token scored against every other token. In the RNN attention from the previous post, the scores were a vector — one score per encoder position, from a single decoder state. Here, it’s a full matrix. Every word attending to every word, all at once.

Encoder-Only Models — When You Only Need to Understand

An encoder-only transformer uses just the encoder stack. No decoder, no generation. It takes an input sentence → processes it through multiple encoder layers → produces contextual embeddings for each token → passes those embeddings to a task-specific head

These models are for understanding, not generating. Text classification. Sentiment analysis. Semantic similarity. They read and interpret.

The main encoder-only models:

They’re all variations on the same idea. Stack encoder layers, pretrain on a huge corpus, then fine-tune for your specific task.

Body and Head

One thing that confused me: transformer models are split into a “task-independent body” and a “task-specific head,” and that “head” has nothing to do with “multi-head attention.” They’re completely separate concepts using the same word.

The body is the stack of encoder layers. It’s pretrained on massive datasets.

BERT uses masked language modeling, others use different pretraining objectives and it learns general-purpose language representations. Not tied to any specific task.

The head is a small neural network layer (or layers) added on top. It uses the representations produced by the body and is trained for a specific goal:

Classification head — sentiment analysis (linear layer + softmax on the [CLS] token embedding)
Token classification head — named entity recognition (classifier on each token)
Span prediction head — question answering
Regression head — predicting a score

The body understands; the head decides.

In practice, this means you can take one pretrained body and swap heads for different tasks. Here’s what that looks like with the emotion detection task from my notebook — classifying tweets into six emotions using DistilBERT:

from transformers import AutoModel
import torch

# The body: pretrained DistilBERT
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Extract the [CLS] token embedding as the sentence representation
def extract_hidden_states(batch):
    input_ids = torch.tensor(batch["input_ids"])
    attention_mask = torch.tensor(batch["attention_mask"])
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    # The [CLS] token is at position 0
    cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    return {"features": cls_embeddings}

That’s the body doing its job of taking raw text and producing contextual embeddings. The head can be as simple as a logistic regression classifier trained on those embeddings. Or you can use Hugging Face’s built-in approach:

from transformers import AutoModelForSequenceClassification

# Body + head in one line: DistilBERT with a 6-class classification head
model = AutoModelForSequenceClassification.from_pretrained(
              "distilbert-base-uncased", num_labels=6)

Pretrained encoder body, classification head on top. Fine-tune on your labeled data and you have an emotion detector. The body was trained on generic text — it has no idea what emotions are — but the representations it learned transfer well enough that a small head can pick them up.

The Decoder — What “Generating” Looks Like in Code

A decoder is a model component that generates outputs, often one token at a time. Unlike encoders which see the entire input at once, decoders are designed to only attend to past tokens during generation, using masked self-attention.

Encoders understand while decoders generate.

The code looks almost identical to the encoder, with one critical difference:

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(2, 5, 4) # same fake input
embedding_dim = 4
hidden_dim = 8

# Step 1: Same Q, K, V projections
W_Q = nn.Linear(embedding_dim, hidden_dim, bias=False)
W_K = nn.Linear(embedding_dim, hidden_dim, bias=False)
W_V = nn.Linear(embedding_dim, hidden_dim, bias=False)

Q = W_Q(x)
K = W_K(x)
V = W_V(x)

# Step 2: Attention scores - same as encoder so far
scores = torch.matmul(Q, K.transpose(-2, -1)) / (hidden_dim ** 0.5)

# THIS IS THE DIFFERENCE: causal mask
seq_len = scores.size(-1)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
scores.masked_fill_(mask, float('-inf'))
# After masking, the scores matrix looks like:
# [ 0.2, -inf, -inf, -inf, -inf ] ← token 0 can only see itself
# [ 0.5, 0.3, -inf, -inf, -inf ] ← token 1 sees tokens 0–1
# [ 0.1, 0.4, 0.7, -inf, -inf ] ← token 2 sees tokens 0–2
# [ 0.3, 0.2, 0.5, 0.8, -inf ] ← token 3 sees tokens 0–3
# [ 0.4, 0.1, 0.3, 0.6, 0.9 ] ← token 4 sees everything

# Step 3–4: Same softmax and weighted sum
attention_weights = F.softmax(scores, dim=-1)
attention_output = torch.matmul(attention_weights, V)

# Step 5: Same FFN
ffn = nn.Sequential(
          nn.Linear(hidden_dim, hidden_dim * 4),
          nn.ReLU(),
          nn.Linear(hidden_dim * 4, hidden_dim)
      )

output = ffn(attention_output)

# Step 6: Project to vocabulary for token prediction
vocab_size = 10000
output_layer = nn.Linear(hidden_dim, vocab_size)
logits = output_layer(output) # (batch_size, seq_len, vocab_size)

Everything is the same as the encoder except for those three lines in the middle. The causal mask sets the upper triangle of the attention score matrix to negative infinity. After softmax, negative infinity becomes zero. Token 3 cannot attend to tokens 4 or 5 — they don’t exist yet.

This is what makes autoregressive generation work. At inference time, the decoder operates in a loop:

Take the tokens generated so far
Pass them through the decoder
Get logits for the next position → pick a token
Append that token to the sequence
Repeat until you hit an end-of-sequence token

The causal mask ensures that at every step, the model only conditions on past tokens. It can’t cheat by looking ahead.

Some notes on the input embeddings in the decoder. In the code above, x=torch.randn(2,5,4) is a sample input — 2 sequences of 5 tokens, each token a 4-dimensional embedding. In real decoder-only models like GPT, these embeddings come from the decoder’s own embedding layer. Each token (eg: The, cat) is first converted into an integer token ID, which is then mapped to a vector via a learned embedding matrix E: x_i = E[token_id_i].

The decoder has its own embedding layer so it’s not relying on a separate encoder. This is different from an encoder-decoder setup where the decoder received hidden states from the encoder.

The Q, K, V projections work the same way as in the encoder. Each weight matrix (W_Q, W_K, W_V) is an nn.Linear layer — a learnable linear transformation initialized with random values and updated during training. If a sentence has 5 tokens, we get 5 query vectors, 5 key vectors, and 5 value vectors. The same meeting analogy applies: Q is the question each token asks, K is what it knows, V is the information it shares. The difference is just the mask — in the decoder, each token can only attend to tokens that came before it.

Why Not Both?

The original transformer had both halves. Encoder reads the source sentence, decoder generates the target. For translation, you need both — you need to understand French before you can produce English.

But most tasks don’t need both.

Classification? You just need to understand the input. Encoder-only. BERT. Text generation? You just need to produce tokens one at a time. Decoder-only. GPT. Summarization? You need to read a long document and generate a shorter one. You could argue this needs both — and models like T5 and BART do use the full encoder-decoder architecture.

It comes down to what the attention mechanism is allowed to see.

Encoders let every token see every other token — bidirectional.
Decoders mask out future tokens — causal.

You can’t easily do both in one model without compromising one or the other.

Closing

The thing that stuck with me after working through all this code is the encoder and decoder are almost the same. Same Q/K/V projections. Same dot product. Same softmax. Same feed-forward network.

The only mechanical difference is three lines of masking code. But that small difference of whether a token can see the future or not is enough to split the entire field into two families of models.

Encoders understand. Decoders generate. And the mask is what decides which one we’re building.

Transformers Without the RNN was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.