LSTM: Why It Was Born, How It Fixes RNN, and Why It Changed Sequence Learning

Sequence data is messy. Words depend on earlier words, notes depend on earlier notes, signals depend on earlier signals, and the past has this annoying habit of refusing to stay irrelevant. That is exactly why recurrent models exist in the first place.

But vanilla RNNs break in a very specific way. They are built for memory, yet they are bad at remembering. That contradiction is the entire reason LSTM exists.

1. Why RNN Was Not Enough

The goal of an RNN was straightforward: handle sequential data where previous information matters.

Common examples include:

language modeling
speech recognition
time series forecasting
translation
chat-like sequence prediction

In theory, an RNN should remember all previous information.

Consider this sentence:

“The boy who lives next door is tall.”

To predict “is tall”, the model must remember “boy”, which appeared many words earlier.

That is the core job of a sequence model. RNN should do it.

But in practice, it fails.

2. Where RNN Fails: The Real Problem

The fundamental issue with a vanilla RNN is simple:

It cannot remember long-term dependencies.

Why this matters

Imagine reading:

“I grew up in France… so I speak fluent ______.”

You instantly know the answer is French.

Why? Because your brain keeps track of information from far back in the sentence.

RNN is supposed to do the same thing.

3. How RNN Should Work

At first glance, RNN seems perfect for sequence data:

it reads words one by one
it keeps a hidden state as memory
it passes that memory forward through the sequence

So theoretically:

RNN should remember everything from the past.

That is the promise.

The problem is that the promise collapses under training.

4. What Actually Happens: The Failure

During training, RNNs use Backpropagation Through Time (BPTT).

That means:

errors flow backward through each time step
gradients are multiplied again and again across the sequence

And here is the real problem:

Gradients shrink exponentially over time.

Mathematically, the gradient through time behaves like a repeated product:

In the simplified case:

if values are less than 1, gradients shrink → vanish
if values are greater than 1, gradients grow → explode

Think of it like this:

recent words have strong influence
old words have almost zero influence

Why?

Because activation functions like sigmoid and tanh produce small derivatives, and repeated multiplication makes the gradient nearly zero.

Example:

So the signal from earlier words dies before it reaches the output.

That is why the model becomes biased toward the most recent inputs.

6. What This Means in Practice

A vanilla RNN behaves like someone with short-term memory only:

it remembers the last few words
it forgets important context from earlier

So in the sentence:

“I grew up in France… so I speak fluent ______.”

the model may remember “so I speak fluent” but forget “France”.

That may leads to the wrong prediction.

7. The Core Problem, Stated Clearly

RNN cannot learn long-term dependencies because gradients vanish over time.

That is the vanishing gradient problem.

Because of this:

RNN focuses mostly on recent inputs
long-range relationships are lost
early tokens contribute almost nothing to learning

This is why RNNs work reasonably well for short sequences, but begin to break down on longer ones.

8. Why Exactly This Happens

There are two main reasons:

activation functions like tanh and sigmoid compress values into small ranges
repeated multiplication through many time steps reduces the gradient toward zero

So early time steps receive almost no learning signal.

That is why:

recent tokens affect the output more
distant tokens barely matter

9. The Real Question Researchers Asked

Researchers asked something much better than “how do we make an RNN slightly less terrible?”

They asked:

Can we build a neural network that remembers important information for a long time and forgets useless information?

That question matters because human memory works that way.

We remember:

important facts
key events

But forget:

noise
irrelevant details

That insight led to Long Short-Term Memory (LSTM).

10. The Turning Point

Researchers realized:

The real problem was not sequence modeling itself.
The real problem was memory retention.

So the question changed from:

“Can we process sequences?”

to:

“Can we build a network that decides what to remember and what to forget?”

That question directly led to LSTM.

11. Birth of LSTM (1997)

Key paper

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.

Core idea

Instead of letting gradients vanish, force constant error flow.

That is the entire point of the architecture.

How LSTM Fixed Memory: The Real Innovation

12. The Basic Idea of LSTM

LSTM separates memory into two ideas:

Short Term Context (STC)
Long Term Context (LTC)

Key concept

important information from STC → moves to LTC
if something becomes not important later → removed from LTC

That is the core design philosophy.

13. The Key Idea of LSTM

LSTM introduces a new concept:

Memory Cell (Cell State)

Think of it as:

a conveyor belt running through time.

Information can travel through this cell almost unchanged.

This solves the gradient issue because:

gradients flow through additive updates
not repeated multiplications

So they do not vanish easily.

14. Problem with RNN

In RNN:

there is only one path to maintain memory/state
this path tries to store both STC and LTC

Reality:

it is not possible to store both effectively in the same path
STC dominates LTC
result → long-term dependencies are lost

That is the core problem.

15. Solution Idea (LSTM)

LSTM introduces another path across timesteps.

This path specifically carries LTC (cell state).

Important information:

stored early
preserved till later timesteps
unless explicitly removed

That is the innovation.

16. Enter LSTM: A Memory System, Not Just a Network

Instead of a single hidden state like an RNN, LSTM introduces something more powerful:

A Cell State (Long-Term Memory)

Think of it like a highway running through time.

information flows almost unchanged
memory can persist across long sequences
stable memory is easier to maintain

This is the central structural advantage.

17. The Highway Analogy

Imagine:

RNN = narrow road → information gets lost
LSTM = highway → information flows smoothly

But here is the twist:

Not everything should stay on the highway.

So LSTM adds control systems.

18. Two Types of Memory

LSTM has two kinds of memory:

Long Term Memory → Cell State (Cₜ)
Short Term Memory → Hidden State (hₜ)

This distinction matters.

the cell state stores memory across time
the hidden state exposes what is needed now

19. Interaction at Time (t)

Inputs:

previous cell state → (Cₜ₋₁)
previous hidden state → (hₜ₋₁)
current input → (xₜ)

Outputs:

current cell state → (Cₜ)
current hidden state → (hₜ)

So, each timestep is not just “output and move on”; it is a controlled memory update.

20. Main Operations

At each timestep, LSTM does two main things:

Update the cell state:
Cₜ₋₁ → Cₜ
Compute the hidden state:
hₜ

That is the entire recurrent step, but with memory control built in.

21. Decision Logic

Based on current input (xₜ), LSTM decides:

what to remove from long-term memory
what to add to long-term memory

This is why it works better than a vanilla RNN, which blindly keeps pushing memory forward.

22. The Secret Weapon: Gates

LSTM introduced gates.

These are small decision makers that ask:

“Should I allow this information or not?”

They output values between 0 and 1:

0 → block completely
1 → allow fully
values in between → partial flow

Three gates exist:

Forget Gate
Input Gate
Output Gate

These gates act like decision systems.

23. The Three Gates That Control Memory

Forget Gate

Question it answers:

What should we forget from the past?

Input Gate

Question it answers:

What new information should be stored?

Output Gate

Question it answers:

What part of memory should influence the output?

Together, these gates define the LSTM.

Detailed LSTM Architecture

24. Forget Gate

Purpose

Remove unnecessary info from the cell state.

Computation

Input:

hₜ₋₁
xₜ

Concatenate:

[hₜ₋₁, xₜ]

Then:

fₜ = σ( [hₜ₋₁, xₜ]W_f + b_f )

Dimensions example

If:

(xₜ) is 4-dimensional
hidden state has 3 unit

Then concatenated input has dimension:

4 + 3 = 7

So:

Weight matrix → (7 × 3)
Bias → (3)
Output (fₜ) → (1 × 3)

Cell update for the forget part

Cₜ₋₁ ⊙ fₜ

Because fₜ ∈ (0,1), each dimension of Cₜ₋₁ is scaled individually:

if fₜ ≈ 1 → keep
if fₜ ≈ 0 → forget
otherwise → partially reduce

Interpretation

The forget gate controls:

“How much of each memory dimension should pass?”

The decision is based on:

Current input: xₜ
Previous hidden state: hₜ₋₁

25. Input Gate

Purpose

Add new important information to the cell state.

Computation

Gate decision:

iₜ = σ( [hₜ₋₁, xₜ] W_i + b_i )

Candidate memory:

C̃ₜ = tanh( [hₜ₋₁, xₜ] W_c + b_c )

Dimensions example

[hₜ₋₁, xₜ] → (1 × 7)
W_i → (7 × 3)
b_i → (1 × 3)
iₜ → (1 × 3)
C̃ₜ → (1 × 3)

Interpretation

Candidate = “what can be added”
Input gate = “what should actually be added”

Meaning

The input gate controls how much of (C̃ₜ) is added.

So:

candidate cell state = potential information
input gate = filter deciding what enters the memory

26. Cell State Update

Now the key update step:

Cₜ = (Cₜ₋₁ ⊙ fₜ) + (iₜ ⊙ C̃ₜ)

Breakdown

1. First term: Cₜ₋₁ ⊙ fₜ

Old memory is filtered. This means some useless data is forgotten.

2. Second term: iₜ ⊙ C̃ₜ

New useful information is added. This is the current input, filtered by relevance.

Dimensions

Cₜ → (1 × 3)

Why this matters

This additive update is the key reason gradients survive longer.

Instead of repeatedly multiplying and shrinking signal strength, LSTM adds information to a stable memory path.

27. Why LSTM Works Mathematically (Full Derivation)

From the cell update:

Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ

In the simplified derivation, the direct path from Cₜ₋₁ to Cₜ is controlled by fₜ, so:

This means:

if fₜ ≈ 1, memory and gradients are preserved
if fₜ ≈ 0, memory is forgotten

Extending Across Time

So the cell state provides a stable, controlled path for long-term information.

By contrast, in RNN:

∂h_T / ∂h_t = ∏_{k=t+1}^{T} (∂h_k / ∂h_{k-1})

and each factor includes repeated matrix multiplication and activation derivatives, which often shrink toward zero.

Gradient Flow through hidden state:

hₜ = oₜ ⊙ tanh(Cₜ)

So:

∂hₜ / ∂Cₜ = oₜ ⊙ (1 − tanh²(Cₜ))

This is why the output is bounded and why gradients remain more stable.

So LSTM separates gradient flow into two paths:

a nonlinear exposed path through hₜ
a stable memory path through Cₜ

28. Key Insight: Vanishing Gradient Problem

— In RNN:

vanishing gradients cause information loss at each step
long timesteps become disastrous

— In LSTM:

possible to have very small information loss
because the cell state supports stable propagation

Important notes:

For longer timesteps, the information loss is negligible.

Special cases

— Case 1

If:

fₜ = 1

Then:

Cₜ = Cₜ₋₁ + iₜ ⊙ C̃ₜ

Meaning: no information lost

— Case 2

If:

iₜ = 0

Then:

Cₜ = Cₜ₋₁ ⊙ fₜ

Meaning: no new information added

This is exactly the kind of controllable memory behavior RNN lacks.

29. Long Sequence Insight

LSTM can preserve long-term context from start to end.

That makes it useful for:

long sentences
delayed dependencies
sequences where early tokens matter much later

This is one of the most important conceptual advantages of the architecture.

30. Output Gate

Computation

oₜ = σ( [hₜ₋₁, xₜ] W_o + b_o )

Dimensions example

[hₜ₋₁, xₜ] → (1 × 7)
W_o → (7 × 3)
b_o → (1 × 3)
oₜ → (1 × 3)

Hidden state

hₜ = oₜ ⊙ tanh(Cₜ)

What this does

Apply tanh to Cₜ → values in (-1, 1)
Apply filtering using oₜ

Interpretation

The output gate decides:

“Value for current hidden state (hₜ)”

This becomes the output passed to the next timestep.

Why Apply (tanh) Before the Output Gate Multiplies?

Why not directly use: hₜ = oₜ ⊙ Cₜ

31. Hidden State in LSTM

The hidden state in LSTM is:

hₜ = oₜ ⊙ tanh(Cₜ)

Short answer

We apply tanh to control scale and stabilize the output. Otherwise, the hidden state can explode and training becomes unstable.

Deep intuition

What is (Cₜ)?

it is the cell state
it keeps accumulating information over time

Important point:

(Cₜ) is unbounded and can grow very large.

That is fine for memory storage, but terrible for direct exposure as output.

Problem if you use (Cₜ) directly

If you do: hₜ = oₜ ⊙ Cₜ

Then:

if (Cₜ = 50), the output becomes huge
if (Cₜ= -100), the output becomes extreme
this propagates forward and makes the network unstable

That leads to:

exploding activations
unstable gradients
poor training

Why tanh(Cₜ)?

Because tanh does this:

tanh(x) ∈ (-1, 1)

So no matter how large (Cₜ) gets:

the output is always controlled

Intuition

Think of it like this:

Cₜ = raw memory (can be messy, large, and unfiltered)
tanh = compress and normalize memory
oₜ = decide what to show

So the actual flow is:

memory stored → ( Cₜ )
normalize → ( tanh(Cₜ) )
filter → multiply with ( oₜ )

Why not let the output gate handle everything?

The sigmoid output gate ( oₜ ) only scales from 0 to 1.

It cannot normalize large values.

Example:

(Cₜ = 100)
(oₜ = 0.9)

Then: hₜ = 0.9 · 100 = 90
That is still huge.

So:

gate controls how much
tanh controls how big

They do different jobs.

(tanh) ensures the hidden state remains bounded, while the output gate decides how much of that bounded information to expose.

Analogy

Think of ( Cₜ ) as raw water pressure in a pipe.
( tanh ) acts like a regulator that keeps pressure safe.
The output gate is the tap deciding how much water to release.

Final insight

This also helps with:

smoother gradients
better generalization
stable training over long sequences

Putting It All Together

32. Summary of Flow

At each timestep:

— Forget gate

removes irrelevant past info

— Input gate

adds new important info

— Cell state update

combines old filtered memory with new filtered information

— Output gate

produces hidden state

33. Full LSTM Flow

Step by step:

Candidate state
C̃ₜ = tanh( [hₜ₋₁, xₜ] W_c + b_c )
Input gate
iₜ = σ( [hₜ₋₁, xₜ] W_i + b_i )
Forget gate
fₜ = σ( [hₜ₋₁, xₜ] W_f + b_f )
Cell update
Cₜ = ( Cₜ₋₁ ⊙ fₜ ) + ( iₜ ⊙ C̃ₜ )
Output gate
oₜ = σ( [hₜ₋₁, xₜ] W_o + b_o )
Hidden state
hₜ = oₜ ⊙ tanh(Cₜ)

34. Dimensional Summary

Component                | Shape
-------------------------|----------
xₜ                        | (1 × 4)
hₜ₋₁                      | (1 × 3)
Concat [hₜ₋₁, xₜ]          | (1 × 7)
Weights (W)              | (7 × 3)
Outputs (gates/cand.)    | (1 × 3)

35. LSTM Diagram Explained Step-by-Step

------------------------------------>
        Cell State (C_t)

👉 This is the Cell State (memory highway)

36. Step 1: Forget Gate

It takes two inputs:

previous hidden state (hₜ₋₁)
current input (xₜ)

h(t-1) ----\
             > [Forget Gate] ---> (multiplies with C(t-1))
x(t)  ------/

What it means

The forget gate decides what past information should be removed from memory.

It produces a value between 0 and 1, and multiplies it with the previous memory:

Cₜ₋₁ ⊙ fₜ

So:

0 means forget completely
1 means keep completely

36. Step 2: Input Gate

This part has two pieces:

Input Gate using sigmoid
Candidate Memory using tanh

h(t-1), x(t) → [Input Gate]
h(t-1), x(t) → [Candidate Values]

Then they are combined.

What it means

The input gate decides what new information should be added.

So:

Candidate (C̃ₜ) = possible new memory
Gate (iₜ) = how much of it to allow

The result is:

iₜ ⊙ C̃ₜ

37. Step 3: Update Cell State

Now combine the old memory and the new memory.

This is the most important part of LSTM:

Cₜ = (fₜ ⊙ Cₜ₋₁) + (iₜ ⊙ C̃ₜ)

Memory is updated by:

removing old irrelevant information
adding new useful information

This is why LSTM works:

no repeated shrinking multiplication
smooth information flow
long-term memory stays alive

38. Step 4: Output Gate

h(t-1), x(t) → [Output Gate]  
C_t → tanh → × Output Gate → h(t)

What it means

The output gate decides what part of memory to use right now.

The final output is:

hₜ = oₜ × tanh(Cₜ)

So, the cell state stores memory, and the output gate decides what gets exposed.

39. Final Full Diagram Logic

When you combine everything:

Left → forget old memory
Middle → add new memory, update cell state
Right → produce output

And the cell state flows straight across like a highway.

That horizontal path is the reason LSTM remembers long-term. The gates just control the traffic.

40. Super Simple Analogy

Think of LSTM like a smart notebook:

✂️ Forget Gate → erase useless notes
✍️ Input Gate → write important things
📖 Output Gate → read only what is needed

41. Intuition in One Line

RNN = “Try to remember everything” ❌
LSTM = “Remember smartly” ✅

42. Why LSTM Solves RNN Problems

Main innovations:

a. Memory Cell

Stores long-term information.

b. Gating Mechanism

Controls information flow.

c. Additive State Updates

Avoids exponential gradient decay.

d. Constant Error Flow

Gradients can propagate through many timesteps.

Because of this:

LSTM can learn dependencies across hundreds or thousands of steps

43. Training Process

Steps

Forward pass through sequence
Compute loss
Backpropagation Through Time (BPTT)
Update weights using gradient descent

Complexity

Time: O(T · n²)
Memory: high, because all states must be stored

44. Improvements Over Time

LSTM did not remain the only gated recurrent design. Researchers pushed it further.

GRU (Cho et al., 2014)

simpler version of LSTM
fewer gates
often faster and easier to train

Peephole LSTM

gates can see the cell state
adds extra access to memory information

Bidirectional LSTM

uses past and future context
useful when the full sequence is available

Attention Mechanism

later reduced reliance on LSTM
improved long-range dependency handling

45. Real Applications of LSTM

Before Transformers, LSTM dominated sequence learning.

Examples:

Google Translate
speech recognition
handwriting recognition
time series prediction
NLP tasks like translation and chatbots
stock prediction
healthcare time series
video processing

46. Limitations of LSTM

LSTM is powerful, but it is not magic.

It still has limits:

slow training
hard to parallelize
memory intensive
sequential computation
difficulty with very long sequences
replaced by Transformers in many tasks

47. Evolution After LSTM

Model progression:

RNN → LSTM → GRU → Attention → Transformers

Because even LSTM has limitations, which led to:

Attention is All You Need (Transformer).

48. Key Academic Sources

Here are referenced foundational and influential papers:

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent. IEEE Trans. Neural Networks.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation.
Graves, A. (2012). Supervised Sequence Labelling with RNNs.
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep RNNs.
Cho, K. et al. (2014). Learning Phrase Representations using RNN Encoder–Decoder.
Greff, K. et al. (2017). LSTM: A Search Space Odyssey.
Olah, C. (2015). Understanding LSTM Networks.
Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling.

49. Finally

LSTM did not just improve RNN.
It changed the idea of memory in neural networks.

Instead of forcing the model to remember everything,
it taught the model how to remember.

RNN failed because gradients vanished over time, making it forget long-term dependencies.

LSTM was born to solve that by using a cell state and gates to control what to remember, what to forget, and what to use.

50. About the Author

Alok Ranjan Singh is a machine learning enthusiast and writer who enjoys breaking down deep learning ideas into clear, intuitive explanations. This article reflects an ongoing effort to study sequence models, gradient flow, and the mechanics behind memory in neural networks.

Connect with me on LinkedIn for discussions:
🔗 LinkedIn Profile Link

LSTM: Why It Was Born, How It Fixes RNN, and Why It Changed Sequence Learning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.