The evolution of the attention mechanism in encoder only models. From BERT to ModernBERT

The AI world is currently obsessed with models that talk. GPT-4, Claude 3, and Llama have become the charismatic orators of our era, generating essays, code, and poetry one word at a time. But in the shadow of these generative giants, a different kind of revolution has been quietly dominating the enterprise: the models that listen.
When you need an AI to read a 10,000-page legal corpus and instantly retrieve the exact clause you need, or when you need to perfectly classify the semantic intent of a complex query, you don’t use a talker. You use a listener. You use an Encoder.
Historically championed by BERT and recently revitalized by long-context marvels like ModernBERT, encoder-only models are making a massive comeback. But how do they actually work? What happens to a word when it enters the high-dimensional void of a transformer?
In this story, we are going to strip away the Python code and the API wrappers. We are going to take a journey into the mathematical heart of the Encoder. We will watch how words find their meaning through geometric handshakes, how they learn the concept of time and order, how they hit the brutal physical limits of modern microchips, and how researchers broke those limits to build the fastest, deepest understanding engines in the world.
Part 1: The Geometry of Meaning
Imagine reading a book, but a piece of paper covers every word after the one you are currently reading. You must guess the meaning of a sentence by only looking at the past. This is how generative models (Decoders) work. They use causal masking because their job is to predict the future.
Encoders, however, read the whole page at once. When an encoder reads the word “bank” in the sentence “The river bank was muddy”, it doesn’t just look backward at “The river”; it looks forward at “was muddy”. This is Bidirectional Attention. Every single word is allowed to look at every other word, creating a deeply interconnected web of meaning.
But “looking at” is a human concept. To a computer, words are just points floating in a high-dimensional space (embeddings). How does one point “look” at another?
The Handshake: Queries, Keys, and Values
When a token enters the attention mechanism, it splits its identity into three distinct vectors — three roles it must play in the cocktail party of the sentence.
- The Query (Q): This is the token asking a question. “I am the word ‘bank’. I am ambiguous. I need financial words or nature words to figure out what I mean.”
- The Key (K): This is the token wearing a nametag. “I am the word ‘river’. I represent nature, water, and geography.”
- The Value (V): This is the actual substance of the token. The raw, semantic payload it will hand over if a match is made.
To find out which words should share information, the Transformer performs a geometric handshake: the Dot Product.
It takes the Query vector of “bank” and computes the dot product with the Key vector of “river”. In linear algebra, a dot product measures alignment. If the vectors point in the same direction, the result is a massive positive number. If they are unrelated, it’s near zero.

The model computes this handshake between every single word and every other word, creating a massive grid of compatibility scores. It then passes these scores through a Softmax function, which turns them into percentages (probabilities that sum to 100%). Finally, “bank” absorbs a percentage of the Value (V) from “river” based on that score.
“Bank” is no longer just a static word. It has absorbed the essence of “river”. It has become contextualized.
The Variance Explosion
But there is a hidden mathematical trap here. Let’s say our vectors have 64 dimensions. If you take two vectors of random noise (mean 0, variance 1) and calculate their dot product, the math dictates that the resulting value will have a mean of 0, but a variance of 64.
As we build larger models with hundreds of dimensions, these raw dot product scores become wild and massive. When you feed massive numbers into a Softmax function, it panics. It pushes the largest number to exactly 100% and everything else to 0%. The gradients flatline. The model completely stops learning — a phenomenon known as the Vanishing Gradient Problem.
To save the network, the creators of the Transformer introduced a brilliantly simple fix: Scaled Dot-Product Attention. Before passing the scores to Softmax, they divide every score by the square root of the number of dimensions (64=864=8).
By dividing by 8, they perfectly scale the variance back down to 1. The Softmax function breathes a sigh of relief, the probabilities become smooth and distributed, and the model learns perfectly.
Part 2: The Curse of Amnesia and the Evolution of Position
We now have a beautiful system where words share meaning based on context. But we have a fatal flaw.
The dot product is symmetric and order-agnostic. To the attention mechanism, “The dog bit the man” and “The man bit the dog” produce the exact same dot products. The model has severe amnesia; it has no concept of time, space, or sequence order. Over the years, researchers have tried various ways to cure this amnesia, fundamentally altering the architecture of the Encoder.
Era 1: Absolute Position (BERT) — The Blunt Instrument
In the original BERT, researchers used Absolute Positional Embeddings. They literally created a unique, learned vector for “Position 1”, another for “Position 2”, and simply added them to the word vectors before the math started.

It worked, but it was a blunt instrument. If BERT was trained on sentences of 512 words, and you handed it a 513th word, it would crash. It had never learned what “Position 513” looked like. Furthermore, it forced the attention mechanism to waste precious computational power trying to reverse-engineer relative distances (e.g., figuring out mathematically that word 5 and word 6 are next to each other).
Era 2: Disentangled Attention (DeBERTa) — The Separation of Concerns
Microsoft researchers (He et al., 2021) realized something profound: the meaning of a word (“Who”) and the position of a word (“Where”) are entirely different concepts. Adding them together in a single vector muddies the waters.
They created DeBERTa and its “Disentangled Attention”. Instead of adding the position to the word, they kept them separate and computed four separate dot products for every handshake:
- Content-to-Content: Does the word “river” relate to the word “bank”?
- Content-to-Position: Does the word “river” care about what is in Position 5?
- Position-to-Content: Does the word in Position 4 care about the word “bank”?
- Position-to-Position: (Usually ignored for efficiency).

By separating the “Who” from the “Where”, DeBERTa achieved state-of-the-art dominance in Natural Language Understanding (NLU), perfectly solving complex grammatical reasoning tests.
Era 3: RoPE (ModernBERT) — Spinning the Dial
Years later, a much more elegant mathematical solution took over the world, culminating in its use in ModernBERT: Rotary Positional Embeddings (RoPE).
Instead of adding a static number to the word, RoPE treats the embedding space as a series of 2D planes and rotates the Query and Key vectors like the hands of a clock. The angle of rotation corresponds directly to the word’s position in the sentence. Token 1 gets rotated a little bit. Token 50 gets rotated a lot.
Why is this brilliant? Because of the geometric rules of rotation. If you want to know the distance between two words, you don’t need to know their absolute positions. You just take the dot product of their rotated vectors. The math naturally cancels out the absolute position and leaves behind a pure encoding of their relative distance.
By switching to RoPE, ModernBERT gained the ability to flawlessly extrapolate and understand distances across massive 8,192-word documents.

Part 3: The Wall and the Speed of Light
As we push our models to read 8,192 tokens (like an entire legal contract or a dense codebase), we hit a brutal physical reality: O(N2)O(N2).
Because every word must shake hands with every other word, an 8,192-word document requires an attention grid of 8,192 x 8,192. That is 67 million dot products. Per layer. Per attention head. We call this the Quadratic Wall. How did researchers try to climb it?
Attempt 1: Sparse Attention (BigBird, Longformer)
If 67 million handshakes are too many, what if we just… stop shaking everyone’s hand? Models like Longformer and BigBird introduced “Sparse Attention.” They forced words to only look at their immediate neighbors (a sliding window of, say, 64 tokens), while electing a few special tokens (like [CLS]) to act as "global" tokens that could see everything.

- The Result: It dropped the computational cost from O(T2)O(T2) to O(T×W)O(T×W).
- The Catch: It sacrificed the deep, holistic understanding that makes encoders powerful in the first place. You can’t truly understand a massive legal contract if you only ever read 64 words at a time.
Attempt 2: Linear Attention (Performer)
What if we change the math? The Performer used the “kernel trick.” By applying a mathematical function to the Queries and Keys, it allowed the matrix multiplication order to be swapped. Instead of multiplying Q×KQ×K (which creates the massive T×TT×T grid), it multiplied K×VK×V first, completely bypassing the massive grid.

- The Result: It dropped the cost to O(T×d2)O(T×d2) — linear scaling!
- The Catch: The kernel trick is an approximation. It turns out, exact Softmax creates incredibly “sharp” attention peaks (e.g., exactly focusing 99% of attention on one crucial word). Linear approximations blur these peaks, devastating the model’s accuracy on precise reasoning tasks.
The Breakthrough: FlashAttention (The Illusion of FLOPs)
In 2022, a paper called FlashAttention pointed out a blinding truth about modern GPU hardware: The math isn’t what takes time. Moving the data is what takes time.
A GPU has a massive, slow memory pool (HBM) and a tiny, lightning-fast brain (SRAM). To calculate the attention grid, older models were generating the 67-million-element matrix in the fast brain, painstakingly copying the whole thing to the slow memory, running Softmax, and copying it back. This traffic jam of data transfer was starving the GPU.
FlashAttention rewrote the rules. By “tiling” the queries and keys into small blocks, it calculates the attention, normalizes the softmax, and outputs the final contextualized value while keeping everything inside the tiny, fast SRAM. It never writes the massive matrix to the slow memory.
It is the exact same math, achieving the exact same result, but it bypasses the physical speed limits of the hardware, achieving up to 10x speedups and freeing up gigabytes of memory.

The Alternating Heartbeat (ModernBERT)
Even with FlashAttention, full N x N attention across 8,192 tokens is heavy. ModernBERT introduced a masterclass architectural compromise: The Alternating Schedule.
Imagine a detective. They spend most of their time looking closely at the clues immediately around them with a magnifying glass (Local Attention — a sliding window of 128 words). But every so often, they stand up, look at the entire crime board, and connect the distant threads (Global Attention — full attention).
ModernBERT interleaves these layers: Local, Local, Global, Local, Local, Global. This heartbeat allows the model to process local syntax blazing fast, while periodically allowing information to teleport across the entire 8,000-word document, operating at a beautiful O(TlogT)O(TlogT) efficiency.

Part 4: Inside the Black Box
We know the math. We know the hardware. But what is the model actually thinking? When we slice open these deep encoder networks, we find fascinating, almost biological behaviors.
The Pressure Valve: Attention Sinks
If you look at the attention grid of a trained transformer, you’ll see something bizarre: massive, dark vertical lines resting on completely meaningless tokens, like commas, periods, or the [CLS] token.
Researchers (Xiao et al., 2024) dubbed these “Attention Sinks”. Remember that the Softmax function forces every word to distribute exactly 100% of its attention. But what if a word like “the” doesn’t need context from the rest of the sentence? It is forced by the math to look at something.
The model brilliantly learns to use these special tokens as garbage dumps. When a word has no strong semantic relationships to pursue, it routes its attention mass into the sink. This acts as a pressure-release valve, protecting the delicate, highly specific relationships between actual content words from being diluted by noise.

The Collapse of Space: Topological Geometry
In my own recent research (“Geometric Concept Spaces in Small Encoders,” Leo et al., 2026), we stopped looking at the attention grids and started looking at the shape of the thoughts themselves.
Using techniques to estimate intrinsic dimensionality, we mapped out the shape of the embeddings as they passed through the deep layers of DeBERTa and ModernBERT.
We discovered a phenomenon we call Topological Collapse.
In older models like DeBERTa, concepts remain highly dimensional and spaced out — like a rich, 3D galaxy of ideas. But in the final layers of ModernBERT, we watched the manifold physically flatten. The model condenses highly complex, high-dimensional concepts into an incredibly dense, nearly 2-dimensional sheet.
This creates “semantic entanglement.” It makes ModernBERT unbelievably fast and accurate at benchmark tasks, but the ideas become so densely packed that it is difficult to cleanly extract fine-grained, underlying concepts without highly complex, non-linear probing tools. It is a stunning visual tradeoff between computational efficiency and representation clarity.

Epilogue: See It to Believe It
You can read the equations of scaled dot-products all day, but nothing compares to watching a neural network think in real-time.
Because seeing the geometry of attention is so vital to understanding it, I have open-sourced the Encoder Attention Explorer, a tool that allows you to feed your own text into BERT, RoBERTa, DeBERTa, and ModernBERT, and watch the attention route itself live.
GitHub Repository: https://github.com/cristianleoo/attention-in-encoders
GitHub - cristianleoo/attention-in-encoders
If you pull down the repo, you can visualize the Token Flow — watching the actual bipartite graph of how a source token distributes its attention mass. You can find the Attention Sinks yourself. You can benchmark the exact moment your GPU hits the O(N2)O(N2) wall.
We are entering an era where understanding the physics and geometry of these models is just as important as writing the code to train them. By mastering the journey from the naive dot-product to the hardware-fused elegance of FlashAttention, we move from being just users of AI, to becoming its architects.
References
[1] Leo, C. and Begimher, D. (2026). Survey of Attention Mechanisms in Encoder-Only Language Models. SSRN Preprint.
[2] Leo, C., et al. (2026). Geometric Concept Spaces in Small Encoders: A Comparative Mechanistic Probing of ModernBERT and DeBERTa-v3.
[3] Vaswani, A., et al. (2017). Attention Is All You Need.
[4] Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers.
[5] He, P., et al. (2021). DeBERTa: Decoding-Enhanced BERT with Disentangled Attention.
[6] Warner, B., et al. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder.
[7] Zaheer, M., et al. (2020). Big Bird: Transformers for Longer Sequences.
[8] Katharopoulos, A., et al. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.
[9] Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention.
[10] Xiao, G., et al. (2024). Efficient Streaming Language Models with Attention Sinks.
How Encoder Transformers Actually Understand Language was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.