Training a Tokenizer That Actually Speaks Italian

Why every English tokenizer butchers Italian, the encoding switch that wasted my first attempt, and the regex that keeps “dell’algoritmo” in one piece.

Fabio Angeletti — PhD in Computer Engineering (Sapienza), Adjunct Professor at LUISS and LUISS Business School, Founder & CEO of LEAF. This is Article 3 of a series documenting the full engineering journey of Dante-2B. Read article 2 here

Assembling 450 Billion Tokens: The Training Data Nobody Had Ready

A tokenizer’s job sounds simple: split text into pieces the model can process. You could explain it in two minutes to a business student. I know — I do it every semester at LUISS.

But for Italian, this seemingly mundane task hides a cultural problem that most English-speaking AI researchers don’t even know exists. And getting it wrong doesn’t just waste efficiency — it structurally limits what the model can learn.

This article is about how I built a custom tokenizer for Dante-2B, why I had to throw away my first version, and the three engineering decisions that made the difference.

Why English Tokenizers Fail at Italian

The apostrophe problem

In English, apostrophes mark possessives or contractions: “it’s,” “don’t,” “Sarah’s.” They’re grammatically optional — you could rewrite any sentence without them.

In Italian, apostrophes are elisions — they mark where two words fuse into a single grammatical unit. “L’intelligenza” means “the intelligence.” “Dell’algoritmo” means “of the algorithm.” “Un’ottimizzazione” means “an optimization.” The apostrophe connects. Remove it, and you’ve broken the syntax.

Every major English tokenizer — GPT’s, LLaMA’s, Mistral’s — treats apostrophes as split points. They were designed for English, where that’s the right behavior. But when you feed them Italian text, “dell’algoritmo” becomes three separate tokens: [“dell”, “‘“, “algoritmo”]. The model sees a broken article, a punctuation mark, and a noun — when an Italian reader sees a single, inseparable phrase.

This isn’t just an efficiency problem. When the apostrophe lands in a different token from both the article and the noun, the model’s attention mechanism has to work harder to learn that these three pieces form one grammatical unit. Multiply that across every elision in every Italian sentence, and you’re systematically handicapping the model’s ability to learn Italian syntax.

The accent problem

Italian uses six accented vowels in daily writing: à, è, é, ì, ò, ù. The word “perché” (why/because) appears in virtually every Italian text. So does “è” (is), “più” (more), “già” (already), “così” (so).

In a byte-level tokenizer — the standard approach used by most modern models — each of these characters is encoded as two bytes. The letter “è” becomes bytes 0xC3 and 0xA8. Unless the BPE algorithm sees enough instances to merge those two bytes into a single token, the model processes every accented character as two meaningless byte fragments.

I learned this the hard way.

The First Tokenizer: What Went Wrong

My first tokenizer version used ByteLevel encoding — the same approach as GPT-2 and most LLaMA-family models. It maps every character to its UTF-8 byte representation, then lets BPE merge from there.

The results looked wrong immediately. The fertility analysis showed accented characters displayed as Ã² and Ã³ — byte-level representations, not actual Italian characters. The vocabulary contained only 23 tokens with Italian accented characters. Zero Italian apostrophe tokens were found.

The Italian fertility — the ratio of tokens to words — was 2.04. For comparison, LLaMA’s Italian fertility is approximately 1.85. My custom tokenizer was worse than the thing I was trying to beat.

But there was a subtlety. I’d trained this tokenizer on only 100,000 documents — roughly 2.6 GB of text. That’s far too little for a 64,000-token BPE vocabulary. The algorithm hadn’t seen enough Italian text to learn useful merges. Common words like “implementazione” were split into five fragments instead of being learned as a single token.

Should I have just trained on more data and stuck with ByteLevel?

No. The architectural problem was deeper than data quantity. ByteLevel encoding forces the BPE to waste merge budget re-learning that 0xC3 + 0xA8 = è. Every accented character costs a merge slot that could have been used for a real Italian word. With 64,000 vocabulary entries and six accented vowels appearing millions of times, that's thousands of merges wasted on byte-level reconstruction.

I needed a different encoding strategy entirely.

The Switch: From ByteLevel to Metaspace

The fix was switching from ByteLevel encoding to Metaspace — a Unicode-native approach. Instead of converting everything to bytes, Metaspace works with actual Unicode characters. Spaces become ▁ markers (that's U+2581, not an underscore). Accented characters remain accented characters. The BPE algorithm operates on real linguistic units from the start.

# ByteLevel: "è" → [0xC3, 0xA8] → two byte tokens
# Metaspace: "è" → "è" → one character token

tokenizer = Tokenizer(models.BPE(unk_token="<|unk|>"))

tokenizer.pre_tokenizer = PreSequence([
    Metaspace(replacement="▁", prepend_scheme="first"),
    Digits(individual_digits=True),
    Split(pattern=Regex(CUSTOM_PRETOK_REGEX), behavior="isolated", invert=False),
])

There’s a trade-off. ByteLevel can represent any byte sequence — it never produces unknown tokens. Metaspace has no byte fallback: any character not in the vocabulary becomes UNK. This is why the unk_token parameter is required. In practice, this is a non-issue: the initial alphabet (which I'll describe below) covers all characters that actually appear in Italian, English, and code text. But it's a design decision worth understanding.

The pipeline order matters. Metaspace runs first, converting spaces to ▁. Then Digits splits individual numbers (so "2024" becomes four tokens: "2", "0", "2", "4" — better for numerical reasoning). Then the custom regex splits the text into linguistic units. If you reverse the order, the regex can't match ▁ because spaces haven't been converted yet.

Before BPE, a normalizer cleans the input with NFKC normalization — which converts decomposed Unicode characters (where “à” is stored as “a” + combining grave accent) into their composed form (a single “à” character). Without this, the same visual character could have two different byte representations, confusing the BPE.

tokenizer.normalizer = NormSequence([
    NFKC(),   # decomposed à → composed à
    Strip(),  # trim whitespace
])

The Italian Regex: Keeping Apostrophes Together

This is the core innovation. A custom pre-tokenization regex that understands Italian elision:

CUSTOM_PRETOK_REGEX = (
    r"['']s|['']t|['']re|['']ve|['']m|['']ll|['']d"  # English contractions first
    r"|▁?\p{L}+(?:['']\p{L}+)*"   # Italian elisions: dell'algoritmo stays whole
    r"|▁?\p{N}+"                     # Numbers
    r"|▁?[^\s\p{L}\p{N}▁]+"        # Punctuation / symbols
    r"|▁"                            # Standalone space marker
)

The trick is in that second line: \p{L}+(?:['']\p{L}+)* matches a letter sequence optionally followed by one or more apostrophe-letter groups. "Dell'algoritmo" matches as a single unit. "L'implementazione" matches as one unit. "Un'ottimizzazione" — one unit.

The first line handles English contractions: “‘s”, “‘t”, “‘re”, “‘ve”, “‘m”, “‘ll”, “‘d”. These are matched first because they appear earlier in the regex alternation, so English possessives and contractions still split correctly. Both ASCII apostrophes (') and curly apostrophes (') are handled — because real text contains both, and missing one silently breaks the other.

Why does this matter for the model? Because when “dell’algoritmo” is a single BPE input unit, the algorithm can learn to represent it as one or two tokens — “dell’algoritmo” or “dell’” + “algoritmo” — based on frequency. When it’s pre-split into “dell” + “‘“ + “algoritmo”, the model never has the option of keeping the elision together. The regex gives BPE the choice. Without it, the choice is made for BPE — and made wrong.

Pre-Seeding the Accent Alphabet

The Metaspace switch solved the encoding problem, but I added a second guarantee. The BPE trainer’s initial_alphabet parameter lets you specify characters that must always exist as individual tokens in the base vocabulary — regardless of their frequency in the training data.

I seeded the alphabet with every accented character that appears in Italian, pan-European languages, and common symbols:

initial_alphabet = list(set(
    "àèéìòùÀÈÉÌÒÙ"       # Italian accented vowels (core)
    "áíóúâêîôûäëïöü"       # Pan-European
    "çñßÇÑ"                 # Romance/Germanic
    "ãõœæøåÃÕŒÆØÅ"         # Extended European
    "▁"                      # Metaspace marker
    "€£¥$@#§°©®™±×÷"       # Currency and symbols
    "\u2013\u2014\u2018\u2019\u201C\u201D\u2026"  # Em dash, curly quotes, ellipsis
))

This guarantees that è is always a single token. Without seeding, the BPE might not encounter enough instances of a rare accented character (say, Ù) to create a merge for it — and the character would either be UNK or split into constituent Unicode bytes (which Metaspace doesn't support). Seeding eliminates this risk entirely.

The alphabet also includes the ▁ Metaspace marker itself — which must be in the alphabet for the pre-tokenizer to function correctly — and typographic characters like em dashes, curly quotes, and the euro sign that appear frequently in Italian text but might not reach the merge threshold in a code-heavy training subset.

Character-Balanced Sampling: The Subtle Mistake

Before training the tokenizer, I needed to prepare a balanced subset of the corpus. BPE allocates its merge budget proportionally to character frequency — whichever language has more characters in the training file gets more vocabulary entries.

My first subset preparation script balanced by document count: 45% Italian documents, 45% English, 10% code. This sounds right but it’s wrong. Italian documents average fewer characters than English ones (Italian uses more affixes and shorter articles, English uses more compound phrases and longer sentences). Balancing by document count produced a character distribution of roughly 40% Italian and 50% English — the opposite of what I wanted.

The v2 script balances by character count. It estimates the average character length per dataset, then calculates how many documents to sample from each to hit the target character budget:

# Language mix — by CHARACTER COUNT, not document count
LANGUAGE_MIX = {
    "it": 0.45,
    "en": 0.45,
    "code": 0.10,
}

There’s also a quality weighting layer. Higher-quality datasets within each language get more representation. FineWeb-Edu and FineWiki get weight 4.0, while Italian-PD gets 1.5. This means the BPE learns more merges from clean, formal text than from OCR-noisy public domain books.

A third version added dataset size caps — because my initial allocation was requesting 63 million documents from a dataset that only contained 1.4 million. The surplus budget gets redistributed to other datasets in the same language group automatically.

The min_frequency Decision

BPE works by counting pair frequencies in the training data and merging the most common pair at each step. The min_frequency parameter sets a threshold: any pair that appears fewer than N times is ignored.

I set it to 5 (up from 2 in my first attempt). Why?

Web-scale text is noisy. Misspellings, OCR errors, random character sequences from corrupted documents — these produce low-frequency byte pairs that the BPE dutifully merges into useless tokens. With min_frequency=2, a typo that appears twice gets a vocabulary entry. With min_frequency=5, it needs to appear five times — which filters out most noise while preserving legitimate rare words.

The trade-off: some legitimate rare Italian words might not get their own tokens. A technical term that appears only three times in the training subset will be split into subwords. But at 64,000 vocabulary entries, the subword fallback is graceful — the term is split into meaningful pieces, not random bytes.

36 Special Tokens: Future-Proofing the Vocabulary

The first 36 token IDs are reserved for special tokens — not learned from data, but hardcoded for specific functions:

Core tokens (IDs 0–5): BOS (begin of sequence), EOS (end of sequence), PAD, UNK, SEP, MASK. The essentials that every language model needs.

Chat and instruct tokens (IDs 6–11): Start/end header markers, end-of-turn, system/user/assistant role markers. These enable the chat formatting that Phase 3 (supervised fine-tuning) will use. I reserved them from the start so the tokenizer doesn’t need to change between pretraining and fine-tuning.

Thinking tokens (IDs 12–13): <think> and </think>. If I ever want the model to reason in a scratchpad before answering — similar to how Claude and o1 handle chain-of-thought — the tokens are already there.

Code and tool tokens (IDs 14–19): Code block markers and tool-use tokens. Future-proofing for agentic capabilities.

Expert routing tokens (IDs 20–35): Sixteen <|expert_N|> tokens reserved for potential Mixture of Experts conversion. If I ever scale Dante to a larger, sparse architecture, the tokenizer is ready.

Is this over-engineering? Maybe. But adding special tokens after training is messy — you either resize the embedding matrix (which can destabilize the model) or waste regular vocabulary entries. Reserving 36 entries out of 64,000 costs almost nothing, and it means I never have to retrain the tokenizer for architectural changes.

The Quality Check

After training, an automated quality check verifies five things:

Fertility. The ratio of tokens to words on standard Italian, English, and code test sentences. Targets: Italian below 1.40 (LLaMA is ~1.85), English below 1.30 (LLaMA is ~1.20), code in the 2.5–3.5 range.

Accent encoding. Each of the six Italian accented vowels (à, è, é, ì, ò, ù) is encoded individually and checked: is it a single token or was it split? After the Metaspace switch and alphabet seeding, all six encode as single tokens.

Apostrophe tokens. The check scans the vocabulary for tokens starting with common Italian elision patterns: ▁l', ▁dell', ▁un', ▁nell', ▁sull', ▁all'. If none are found, the Italian regex isn't working.

UNK check on code characters. Braces, brackets, operators, semicolons, backticks — all the punctuation that code requires. If any produce UNK tokens, the model can’t process code.

Special token IDs. Verify that <|begin_of_text|> is ID 0, <|end_of_text|> is ID 1, and so on. A misordered special token would silently corrupt every training sequence.

# From check-tokenizer.py — the pass/fail logic
issues = []
if avg_it > 1.85:
    issues.append(f"Italian fertility WORSE than LLaMA (~1.85)")
elif avg_it > 1.40:
    issues.append(f"Italian fertility above target but better than LLaMA")
if not italian_apo:
    issues.append("No Italian apostrophe tokens found — regex may not be working")

The Result

The final tokenizer has 64,000 vocabulary entries. Italian fertility well below LLaMA’s 1.85 benchmark. All accented characters encode as single tokens. Italian apostrophe elisions are preserved in the vocabulary. All code characters handled without UNK.

What does this mean in practice? When Dante-2B processes Italian text, it sees more content per token than a LLaMA-based model. The effective context window for Italian text is 30–40% larger. And the model’s attention mechanism can learn Italian syntax directly, without first having to reconstruct elisions from scattered token fragments.

The tokenizer is one file — tokenizer.json — and it will be released as part of the open-source package on GitHub and HuggingFace along with the training script, the subset preparation script, and the quality check.

What I’d Do Differently

Test with the full dataset from day one. My first tokenizer attempt with ByteLevel encoding was trained on 100K documents. The terrible fertility numbers were partly a data quantity issue and partly an architectural issue, and I couldn’t separate the two. If I’d trained both ByteLevel and Metaspace versions on the full subset from the start, the comparison would have been clean and I would have saved a week.

Add language detection to the subset preparation. My character-balanced sampling assumes that documents in the it/ directory are actually Italian. Some web crawl data contains mislabeled documents — English text in the Italian folder, or machine-translated text that's technically Italian but linguistically garbage. A fast language detection pass (even a simple heuristic based on Italian function words) would have caught these before they contaminated the tokenizer training.

The Takeaway

If you’re evaluating an Italian AI model — or any non-English AI model — ask about the tokenizer. Not just the parameter count, not just the training data volume. Ask: what tokenizer does it use? Was it designed for this language, or adapted from English?

A model with a native tokenizer processes its target language 30–40% more efficiently than a model using a borrowed English tokenizer. That’s not a small margin. It means more context per API call, better syntax learning, and lower inference costs. It’s one of the few genuine technical advantages a smaller, focused model can have over a larger multilingual one.

The tokenizer is the foundation. Everything the model learns passes through it. If the foundation breaks Italian into meaningless fragments, the model will spend its entire training budget trying to reassemble them — instead of learning the language.

Next: Article 4 — Designing a 2-billion parameter architecture, and the head dimension that silently cost me 20% of my GPU performance.

I’m Fabio — with LEAF I bring emerging technologies to businesses before they go mainstream. At LUISS and LUISS Business School, I teach deep tech to the people who won’t build these technologies, but will decide whether to adopt them.

Training a Tokenizer That Actually Speaks Italian was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.