Building an LLM From Scratch. Here’s Where It All Starts.

While exploring and building LLM from scratch without any modules, I started from the lowest but very important level i.e. Character Based Tokenization

In pre-2010, most of the systems were rule based and statistical(N-Grams) were used.
This method was used in early in NLP systems for spell checks, language detection and very basic text generation before the Byte Pair Encodings became dominant.
But after 2018, since the “Attention is All You Need” came and became the AI revolution most of the model shifted to Sub-Word tokenizers (BPE, WordPiece) and these character level tokenizers were abandoned from large languages.

Why Character Level Tokenizers Were Actually Smart:

No out of vocabulary problems (OOV problem solved).
Since, character level tokenizers use each character and if they were trained in diverse text, there would be no problem of Out of vocabulary problem as all the characters would already be in the training set and probability matrix.
Language Independent.
In Character level encoding, there were no code change eve if we change the language of the training vocabulary because it would just do the same process with that language’s text.
Even though there is no code change, but still the language detection becomes complex like in Hindi there are complex character combinations which would require diff type of handling.
Can handle Typos naturally.
“helo” → still understandable, whereas in word level encoding, “helo” would give OOV which is handled by Laplace smoothening.

Let’s Build It. From Scratch.

Let’s take a training corpus short and simple

text = "in the midst of chaos, there is also opportunity. the journey of learning is never easy, but it is always rewarding. every mistake you make is a step closer to understanding. intelligence is not just about knowing facts, but about connecting ideas and seeing patterns where others see noise. persistence and curiosity are the two most powerful tools you can carry. over time, even the most complex problems begin to feel manageable, and what once seemed impossible becomes second nature."

In this text you can see most of the alphabets space and some punctuations too. But machine doesnt understand any of it.
So we convert these into numbers by giving them unique indexes

char = sorted(set(text))

char_to_idx = {}
for i, ch in enumerate(char):
    char_to_idx[ch] = i

idx_to_char = {}
for ch, i in char_to_idx.items():
    idx_to_char[i] = ch

In this code block we have done 2 things mainly:
1. Gave all the characters unique numbers
2. Made dictionaries for the index to char and char to index lookups.

tokenizer = {
    "char_to_idx": char_to_idx,
    "idx_to_char": idx_to_char
}
bigram_counts = np.zeros((len(char), len(char)))
tokens = [char_to_idx[ch] for ch in text]

for i in range(len(tokens)-1):
    bigram_counts[tokens[i], tokens[i+1]] += 1

In this section of the code, we have mainly built the matrix where we find how frequent 2 adjacent pair arrive in the training set and store them in the matrix of order or size same as the number of unique characters.

def softmax(x):
    exp_x= np.exp(x - np.max(x))
    return exp_x / exp_x.sum()

bigram_probs = np.zeros_like(bigram_counts)
for i in range(len(char)):
    rows = bigram_counts[i]
    if rows.sum() > 0:
        bigram_probs[i] = softmax(rows)

This is the main mathematical part as we normalize the matrix by applying the SoftMax function and make all the values within the range of [0,1].

One thing I want to be honest about in my implementation. I used SoftMax to normalize the bigram counts, which is not exactly the right move.

Softmax was designed for logits, not frequency counts. When you apply it to counts, it subtly squishes the difference between frequent and rare pairs. A pair appearing 50 times vs 2 times feels closer than it actually is.

The correct approach is just plain normalization:

python

bigram_probs[i] = bigram_counts[i] / bigram_counts[i].sum()

Small difference in code, meaningful difference in understanding.

Thats the matrix after applying SoftMax function.

def generation(start_char, max_length):
    start_idx = char_to_idx[start_char]
    current_idx = start_idx
    result = [start_char]

    for _ in range(max_length - 1):
        next_probs = bigram_probs[current_idx]
        next_idx = np.random.choice(len(char), p=next_probs)
        result.append(idx_to_char[next_idx])
        current_idx = next_idx
    
    return ''.join(result)

this function uses all the previous blocks and helps to generate the next character by passing the starting character and the total number of characters we want.

It basically, works on the Conditional probability equation where the current event is given and with the help of that, we have to find the occurrence of next event

Bigram Counts:
P(next_char | current_char)

But Here’s Where It Breaks

Too long sequences
For example, “I am understanding the concept”:
Character level tokenizer would take 30–40 tokens, whereas sub-word level would only take 6–8 tokens.
It is a problem because higher tokens explode in transformers (Attention Cost = O(n^2) therefore character level tokens would become 1600 operations, but in sub-word level there would be 86 (approx.)).
Model must build meaning from scratch
To understand “Computers”, word level would be like c -> co -> com -> comp …….
Extremely inefficient.
Long-range dependencies
If text: “cat on the mat was hungry”
to compute “was hungry” — — Who? — → “the Cat”
model must remember the cat to compute was hungry which requires memory, and it didn’t have that.

So, What Actually Fixed This?

These weren’t small inconveniences. The attention cost alone made character level tokenization a dead end.

Then came the Byte Pair Encoding. But BPE was not invented for NLP at all.

Philip Gage invented it in 1994 for file compression. The idea was simple: find the most frequent pair of bytes, replace it with one unused byte and repeat. Smaller file, same information.

In 2015, Sennrich et al. looked at that and thought, what if we do this with characters in language? Instead of compressing files, build a vocabulary. It worked.

How does it actually work?

Let’s take a small corpus, same way we did before:

"low low low wide new"

Start with individual characters:

l o w
l o w
l o w
w i d e
n e w

Now count every adjacent pair:

(l, o) → 3
(o, w) → 4   ← most frequent
(w, i) → 1

Merge the most frequent one. o + w becomes ow:

l ow
l ow
l ow
w i d e
n e ow

Count again. Now (l, ow) appears 3 times. Merge it:

low
low
low
w i d e
n e ow

we keep doing this until we hit our vocabulary size limit. That limit is something we set ourselves. GPT-2 set it at 50,257.

The result is sub words. Not full words, not single characters. Chunks like “un”, “ing”, “tion” that appear frequently enough to deserve their own token.

BPE never decides what makes sense. It just merges what appears most often. And somehow that produces meaningful subwords anyway.

Where BPE is today

GPT-2, GPT-3, GPT-4 use tiktoken, BPE based. LLaMA and Mistral use SentencePiece, a BPE variant. BERT uses WordPiece, same idea slightly different merge rule.

Every time you type into ChatGPT, your words are being silently chopped into subwords by logic borrowed from a 1994 file compression paper.

But it is not perfect

Vocabulary size is a guess. Too small and rare words get over split. Too large and memory bloats. No perfect number, just tuning.

It is not semantic aware. BPE does not know “unhappy” and “not happy” mean the same thing. It only sees character patterns.

And it struggles with non-English languages. Hindi, Arabic, Chinese. A single Hindi word can explode into many tokens. Still an active research problem today.

What I Actually Learned Building This

Starting from zero, no libraries, no shortcuts, I felt every limitation of character-level tokenization personally. Sequences too long. Context evaporating. Meaning rebuilt from scratch every single time.

That frustration is what made BPE click for me.

BPE didn’t make models understand language. It solved the infrastructure problem that was blocking everything. Right sized chunks, not too small, not too large, gave transformers a real fighting chance.

Sometimes the best ideas just need the right problem to find them.

Building an LLM From Scratch. Here’s Where It All Starts. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Character Level Tokenizers Were Actually Smart:

Let’s Build It. From Scratch.

But Here’s Where It Breaks

So, What Actually Fixed This?

What I Actually Learned Building This

Leave a Comment