I spent some time trying to understand attention mechanisms and kept running into the same problem — tutorials would show the architecture diagram, briefly touch on the math, and move on. So I worked through the numbers myself. This is what I came away with.
The Bottleneck That Started It All
Seq2Seq models are designed to map sequences of one length to sequences of another. A classic example is machine translation: the English phrase “I eat rice” (3 words) translates to the French “Je mange du riz” (4 words). The input and output can be any length, and they don’t have to match.
Before Seq2Seq models, traditional neural networks like standard recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) struggled with several sequence modelling. These models couldn’t handle input-outputs of different lengths.

Seq2Seq solved this by using two RNNs: an encoder and a decoder. The encoder reads the entire input sequence and compresses its meaning into a single, fixed-length context vector — the encoder’s final hidden state. This vector is then passed to the decoder, which unfolds it to generate the output sequence one word at a time.
You need two RNNs because a single RNN produces one output per input step. If the input is 3 words but the translation is 5, a single RNN has no way to produce those extra words.

The problem is that single context vector. Whether your sentence is 5 words or 500, the hidden state is the same size (say, 256 numbers). For short sentences, that’s plenty. But for long sentences, you’re stuffing more and more information into the same fixed space. You’d end up losing a lot of detail. And because the RNN processes sequentially, information from early words gets repeatedly transformed and mixed with later words — by word 100, the details of word 1 have been overwritten.
Bahdanau’s attention (2015) changed this.
Instead of forcing all information through a single vector, the decoder gets to look back at the hidden states from every step of the encoder. At each step of generating an output, the decoder creates a dynamic context vector by assigning relevance scores to all the input words and taking a weighted sum. So instead of one pre-baked summary, it builds a custom summary at each decoding step, weighted toward whichever input words are most relevant to the word it’s currently trying to produce.
Luong’s attention came after, introducing “global” and “local” variants with simpler alignment scoring functions. Both mechanisms address the same information bottleneck: a single fixed vector cannot carry long sequence meaning.
Three Components of Attention
According to the book Deep Learning, an attention-based system consists of 3 components:
- A process that “reads” raw data (such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position. The encoder takes in each input word and produces a hidden state vector that captures its meaning in context. In the code later, these are S_0, S_1, S_2.
- A list of feature vectors storing the output of the reader. This can be understood as a “memory”. It contains a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them. Unlike basic RNNs where information flows strictly left-to-right, attention can reach into any position of this memory at any time. In the code, this is S — the matrix of all encoder hidden states.
- A process that “exploits” the content of the memory to sequentially perform a task, at each time step having the ability to put attention on the content of one memory element (or a few, with a different weight). At each step, the decoder decides which memory elements are most relevant to the current subtask and assigns different weights to them.
The context vector is the result of this workflow: compute alignment scores → calculate attention weights using Softmax → apply weights to encoder hidden states → weighted sum.
It gets recalculated at every decoder time step. A dynamic summary of relevant input, rebuilt fresh each time the decoder needs to produce the next word. We’ll see this flow in code later.
Attention lets the decoder focus on relevant information regardless of where it appears in the sequence. Earlier sequence models couldn’t do this. Information degraded over many processing steps — by the time the RNN reached the end of a long sentence, details from the beginning had been overwritten.
Before attention, the encoder decided what mattered by compressing everything into one vector. The decoder had no say. Attention splits this into two learnable steps:
- Figure out what’s relevant (the attention weights)
- Then, figure out what to do with it (the word prediction)
The “what’s relevant” part gets recalculated fresh at every single decoding step. The memory just exists, and the decoder chooses what to pull from it each time.
Luong’s Three Scoring Functions
In the paper, Luong defines h_t as the decoder’s hidden state at time step t and h_s as the encoder’s hidden state at position s.
There are 3 ways to compute the score between them:
- Dot (Dot Product) — simple dot product between decoder and encoder hidden states. Fast, no parameters.
- General — introduces a learnable matrix W_a which allows for a linear transformation of the encoder states before comparison.
- Concat — concatenates h_t and h_s then passes through a learned linear layer W_a and non-linearity. v_a is a learned vector projecting to a scalar score. Most expressive but computationally heavier.
The worked example below uses the dot product — it’s the simplest to trace by hand.
Computing the Context Vector
Computing the context vector is a 5-step process:
- Prepare inputs
- Compute alignment scores using dot product
- Compute attention weights via softmax
- Multiply each vector by its softmaxed score
- Sum up to output the context vector
Step 1: Prepare Inputs
Let’s say we have three encoder hidden states and one decoder hidden state:
# the 3 encoder hidden states
S_0 = [0.3, 0.11, 0.9, 0.5]
S_1 = [0.8, 0.3, 0.7, 0.1]
S_2 = [0.5, 0.3, 0.4, 0.8]
# the decoder hidden state
T_1 = [0.2, 0.7, 0.9, 0.3]
Step 2: Compute Alignment scores
The dot product of each encoder hidden state with the decoder hidden state:
S_0 · T_1 = (0.3)(0.2) + (0.11)(0.7) + (0.9)(0.9) + (0.5)(0.3) = 1.097
S_1 · T_1 = (0.8)(0.2) + (0.3)(0.7) + (0.7)(0.9) + (0.1)(0.3) = 1.030
S_2 · T_1 = (0.5)(0.2) + (0.3)(0.7) + (0.4)(0.9) + (0.8)(0.3) = 0.910
These alignment scores tell us how relevant each encoder position is to what the decoder needs right now.
Step 3: Compute Attention Weights Via Softmax
Exponentiate, then normalize so they sum to 1:
exp(1.097) = 2.995
exp(1.030) = 2.801
exp(0.910) = 2.484
total = 8.280
attn_weights = [0.362, 0.338, 0.300]
S_0 gets attention weight of 36.2%, S_1 gets 33.8% and S_2 gets 30.0%.
Steps 4 and 5: Multiply Each Vector To Get Weighted Sum
0.362 × [0.3, 0.11, 0.9, 0.5] = [0.109, 0.040, 0.326, 0.181]
0.338 × [0.8, 0.3, 0.7, 0.1] = [0.271, 0.101, 0.237, 0.034]
0.300 × [0.5, 0.3, 0.4, 0.8] = [0.150, 0.090, 0.120, 0.240]
Context vector c₁ = [0.529, 0.231, 0.682, 0.455]
This context vector is a dynamic summary of relevant input — it gets recalculated at every decoder time step. At the next step the decoder will have a different hidden state, the alignment scores will shift, and a different context vector will come out.
The whole thing in code:
import numpy as np
S_0 = [0.3, 0.11, 0.9, 0.5]
S_1 = [0.8, 0.3, 0.7, 0.1]
S_2 = [0.5, 0.3, 0.4, 0.8]
S = np.array([S_0, S_1, S_2])
T1 = np.array([0.2, 0.7, 0.9, 0.3])
e = np.dot(S, T1) # alignment scores
exp_e = np.exp(e)
attn_weights = exp_e / np.sum(exp_e) # attention weights
context_vector = np.sum(attn_weights[:, np.newaxis] * S, axis=0) # context vector
The Two Softmaxes
There are two softmaxes in the full pipeline and they answer different questions:
- First softmax is for the attention weights: ”Which parts of the input sentence should I look at right now?”
- Second softmax is for word prediction: ”Now that I’ve looked, which word should I actually say next?”
Once we have the context vector c_1, we combine it with the decoder hidden state and pass the result through another layer to generate the final prediction:

Then:

Where h_tilde is the attentional hidden state, W_c is a learned weight matrix combining context and decoder state, and W_s maps the attentional hidden state into a vocabulary-sized vector. That vector is softmaxed to become a probability distribution over all possible words in the target language.
In code, with a toy vocabulary of 5 words:
combined = np.concatenate([context_vector, T1]) # Shape: (8,)
# W_c: learned weight matrix (random here for demo)
hidden_size = 6
np.random.seed(42)
W_c = np.random.randn(hidden_size, combined.shape[0]) # Shape: (6, 8)
h_tilde = np.tanh(np.dot(W_c, combined)) # attentional hidden state
# W_s: maps to vocabulary logits
vocab_size = 5
W_s = np.random.randn(vocab_size, hidden_size) # Shape: (5, 6)
logits = np.dot(W_s, h_tilde) # raw scores
def softmax(x):
exp_x = np.exp(x - np.max(x))
return exp_x / np.sum(exp_x)
probs = softmax(logits)
# Word 0: 73.9%
# Word 1: 8.5%
# Word 2: 7.7%
# Word 3: 0.8%
# Word 4: 9.2%
The model picks Word 0 with 73.9% confidence. In a real system, that maps to an actual word in the target language. The weight matrices W_c and W_s here are random — in a trained model they’re learned through gradient descent.
What Real Learned Attention Looks Like
All the numbers above were hand-picked to show the math. But what do S_0, S_1, S_2, and T_1 actually look like when they come out of a trained model?
Below is a small encoder-decoder RNN with attention, trained on English-to-French translation. The architecture is the same as everything above: the dot product, the softmax, the weighted sum. Except, now the vectors are the output of actual training, not numbers I typed in.
import torch
import torch.nn as nn
import torch.optim as optim
training_pairs = [
("the cat sat", "le chat assis"),
("the dog ran", "le chien couru"),
("a cat sat", "un chat assis"),
("a dog ran", "un chien couru"),
("the cat ran", "le chat couru"),
("the dog sat", "le chien assis"),
("a cat ran", "un chat couru"),
("a dog sat", "un chien assis"),
]
SOS_TOKEN = 0
EOS_TOKEN = 1
def build_vocab(sentences):
vocab = {"<SOS>": 0, "<EOS>": 1}
for sentence in sentences:
for word in sentence.split():
if word not in vocab:
vocab[word] = len(vocab)
return vocab
eng_vocab = build_vocab([p[0] for p in training_pairs])
fra_vocab = build_vocab([p[1] for p in training_pairs])
fra_vocab_inv = {v: k for k, v in fra_vocab.items()}
def sentence_to_tensor(sentence, vocab):
return [vocab[w] for w in sentence.split()] + [EOS_TOKEN]
X = [sentence_to_tensor(p[0], eng_vocab) for p in training_pairs]
Y = [sentence_to_tensor(p[1], fra_vocab) for p in training_pairs]
max_len_x = max(len(s) for s in X)
max_len_y = max(len(s) for s in Y)
X = torch.LongTensor([s + [EOS_TOKEN] * (max_len_x - len(s)) for s in X])
Y = torch.LongTensor([s + [EOS_TOKEN] * (max_len_y - len(s)) for s in Y])
HIDDEN_SIZE = 32
class Encoder(nn.Module):
def __init__(self, vocab_size, hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
def forward(self, x):
embedded = self.embedding(x)
encoder_outputs, hidden = self.rnn(embedded)
return encoder_outputs, hidden
class AttentionDecoder(nn.Module):
def __init__(self, vocab_size, hidden_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_size)
self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
self.fc_out = nn.Linear(hidden_size * 2, vocab_size)
def forward(self, decoder_input, decoder_hidden, encoder_outputs):
embedded = self.embedding(decoder_input)
decoder_output, decoder_hidden = self.rnn(embedded, decoder_hidden)
scores = torch.bmm(encoder_outputs, decoder_output.transpose(1, 2))
attn_weights = torch.softmax(scores, dim=1)
context = torch.bmm(attn_weights.transpose(1, 2), encoder_outputs)
combined = torch.cat([decoder_output, context], dim=2)
prediction = self.fc_out(combined.squeeze(1))
return prediction, decoder_hidden, attn_weights.squeeze(2)
encoder = Encoder(len(eng_vocab), HIDDEN_SIZE)
decoder = AttentionDecoder(len(fra_vocab), HIDDEN_SIZE)
criterion = nn.CrossEntropyLoss(ignore_index=EOS_TOKEN)
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=0.01)
for epoch in range(500):
encoder_outputs, hidden = encoder(X)
loss = 0
decoder_input = torch.full((len(X), 1), SOS_TOKEN, dtype=torch.long)
for t in range(max_len_y):
pred, hidden, attn_w = decoder(decoder_input, hidden, encoder_outputs)
loss += criterion(pred, Y[:, t])
decoder_input = Y[:, t].unsqueeze(1)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch+1}/500, Loss: {loss.item():.4f}")
After training, inspect what the model actually learned:
test_sentence = "the cat sat"
test_tokens = sentence_to_tensor(test_sentence, eng_vocab)
test_tensor = torch.LongTensor([test_tokens + [EOS_TOKEN] * (max_len_x - len(test_tokens))])
input_words = test_sentence.split() + ["<EOS>"]
with torch.no_grad():
enc_out, hidden = encoder(test_tensor)
print("ENCODER HIDDEN STATES (the real S_0, S_1, S_2)")
for i, word in enumerate(input_words):
vals = enc_out[0, i, :8].numpy().round(3)
print(f' S_{i} ("{word}"): {vals}…')
decoder_input = torch.LongTensor([[SOS_TOKEN]])
predicted_words = []
with torch.no_grad():
for t in range(max_len_y):
pred, hidden, attn_w = decoder(decoder_input, hidden, enc_out)
predicted_id = pred.argmax(dim=1).item()
if predicted_id == EOS_TOKEN:
break
predicted_word = fra_vocab_inv[predicted_id]
predicted_words.append(predicted_word)
weights = attn_w.squeeze().numpy()
print(f'\nGenerating "{predicted_word}":')
for i, word in enumerate(input_words):
bar = "█" * int(weights[i] * 30)
print(f' "{word:>5}" → {weights[i]:.3f} {bar}')
decoder_input = torch.LongTensor([[predicted_id]])
print(f'\nInput: "{test_sentence}"')
print(f'Translated: "{" ".join(predicted_words)}"')
When generating “le”, the model should attend mostly to “the”. When generating “chat”, it should attend to “cat”. The attention weights make this visible — the decoder is choosing what to pull from the memory at each step, exactly like the three-component framing described earlier. The only difference from the numpy walkthrough is that these numbers came out of gradient descent instead of being typed in by hand.
What Comes Next
Everything above is attention as an add-on to RNNs — the encoder and decoder are still recurrent networks, and attention just helps the decoder look back more effectively.
The next post covers what happens when you drop the RNNs entirely and build a model out of nothing but attention.
Attention With Actual Numbers was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.