Assembling 450 Billion Tokens: The Training Data Nobody Had Ready

Ten datasets. Three languages. Broken APIs, nested fields, and giant books that didn’t fit in my pipeline. The unglamorous foundation of everything that follows.

Fabio Angeletti — PhD in Computer Engineering (Sapienza), Adjunct Professor at LUISS and LUISS Business School, Founder & CEO of LEAF. This is Article 2 of a series documenting the full engineering journey of Dante-2B. Read Article 1 here

Why I’m Training an Italian Language Model from Scratch — With Two GPUs and No Funding

Before you can train a language model, you need data. This is the part of AI that nobody talks about at conferences. The papers mention “a corpus of X billion tokens” in a single sentence, then spend twenty pages on the architecture. As if the data just appeared.

It doesn’t.

For an Italian-focused model, the problem is worse. English training data is abundant — Common Crawl, The Pile, RedPajama, FineWeb — curated, deduplicated, ready to use. For Italian, you’re assembling your own dataset from scratch, dealing with broken APIs, inconsistent formats, and the uncomfortable question of whether the data you found is actually good enough.

This article is about how I built the training corpus for Dante-2B. The sources, the failures, the engineering decisions, and the system I built to keep it all from falling apart.

The Shopping List

I needed roughly 450 billion tokens, split approximately 45% Italian, 45% English, and 10% code. Not because Phase 1 would consume all of them — the plan was 90 billion tokens for base pretraining — but because I wanted headroom. Enough data for multiple passes without the model memorizing everything, and enough diversity to cover different registers of Italian: web text, formal legal language, literary prose, parliamentary debate, encyclopedic writing.

Here’s what I assembled — ten datasets, each with its own personality and its own problems.

Italian sources

FineWeb-2 Italian — my primary Italian web source, and the single largest dataset in the corpus at an estimated 150 billion tokens. This is HuggingFace’s cleaned web crawl, globally deduplicated using MinHash. The quality is high because the deduplication happens across the entire crawl, not just within individual dumps. This matters: without cross-crawl dedup, the same article copied across a hundred websites counts a hundred times. The model memorizes it instead of learning from it.

Cleaned mC4 Italian — a highly filtered version of the multilingual Common Crawl, contributing roughly 50 billion tokens. Think of it as a complementary web source to FineWeb-2: different filtering criteria, different extraction pipeline, different set of documents that survived the quality threshold. Having two independent web sources for the same language reduces the risk that a single pipeline’s blind spots become the model’s blind spots.

Italian-PD — 171,000 Italian public domain books. Roughly 13 billion words. From Manzoni to obscure legal treatises from the 1800s. Roughly 32 billion tokens after tokenization — over twice the initial estimate, because public domain books are denser than web text. This dataset gave me more headaches than all the others combined. I’ll get to that.

Gazzetta Ufficiale — Italy’s Official Gazette, approximately 5 billion tokens. Every law, decree, regulation, and official notice published by the Italian government. This is the kind of data that makes an Italian model Italian, not just a model that happens to output Italian words. If you want the model to understand bureaucratic Italian, legal language, and the formal register that companies deal with daily — this is where it learns.

FineWiki Italian — Wikipedia with improved extraction compared to standard database dumps, around 3 billion tokens. Wikipedia is every model’s encyclopedia, and the Italian version is crucial for factual grounding.

FinePDFs Italian — high-quality documents extracted from PDF files, about 0.5 billion tokens. Smaller in volume but often higher in quality — academic papers, technical reports, official documents that don’t appear on the open web.

EuroParl Italian — European Parliament proceedings in Italian, roughly 0.5 billion tokens. Parliamentary debate has a specific cadence and vocabulary. It’s also one of the highest-quality parallel corpora available, which matters if you care about the model’s ability to handle formal Italian argumentation.

English sources

FineWeb-Edu — a 100 billion token sample filtered specifically for educational content. This dataset has become the gold standard in the open-source LLM community because models trained on it show massive improvements on reasoning benchmarks like MMLU and ARC. The filtering is aggressive — it selects text that looks like it was written to teach something, not just to exist on the internet.

FineWiki English — English Wikipedia with the same improved extraction pipeline as the Italian version, approximately 8 billion tokens.

Code

StarCoderData and its expanded variant— up to 50 million documents of multi-language source code, estimated at 40 billion tokens. Python, JavaScript, Java, C++, and dozens of other languages. The code component isn’t about making Dante-2B a coding model — it’s about teaching it structured reasoning. Code has strict syntax, clear logical flow, and formatting patterns (HTML, Markdown, JSON) that appear constantly in web text. Models trained with some code data consistently outperform code-free models on non-code tasks.

The Three Things That Went Wrong

Everything in the list above sounds clean when described in a paragraph. The reality was messier.

Problem 1: Italian-PD and the broken streaming API

Most datasets download smoothly. HuggingFace’s load_dataset with streaming mode handles them without drama — you point it at a dataset, it starts yielding documents, you write them to JSONL shards on disk. My pipeline uses a ShardWriter class that collects 500,000 documents per shard, checkpoints progress, and supports resumable downloads.

Italian-PD was different.

The streaming metadata on HuggingFace was broken. The standard approach — load_dataset("PleIAs/Italian-PD", streaming=True) — simply failed. No helpful error message. Just a hanging connection or a cryptic metadata parse error.

I spent an afternoon debugging before accepting the reality: I’d have to download the raw Parquet files directly via the HuggingFace API and convert them myself.

# When streaming doesn't work, go raw
hf_hub_download(
    repo_id="PleIAs/Italian-PD",
    filename=parquet_file,
    repo_type="dataset",
    local_dir=output_dir,
)

The download was the easy part. The conversion was where things got interesting.

Parquet files are structured — they have named columns. You’d expect the text column to be called “text.” It wasn’t always. Some files used “content.” Some used “body.” I wrote a detection pipeline that tries standard names first, then falls back to heuristic detection: find the first string column where the average value is longer than 50 characters.

text_col = None
for candidate in ["text", "content", "body", "document"]:
    if candidate in columns:
        text_col = candidate
        break

# Fallback: first long string column
if text_col is None:
    for col in columns:
        vals = sample.get(col, [])
        if vals and isinstance(vals[0], str) and len(vals[0]) > 50:
            text_col = col
            break

Not elegant. But it worked on every Parquet file in the dataset.

And then: the giant shard problem. Italian-PD documents are big. Public domain books average around 100,000 characters each. My standard shard size of 500,000 documents would have produced multi-terabyte files. I dropped it to 10,000 documents per shard — which still produced shards around 4 GB each. This was a lesson in knowing your data before setting infrastructure parameters.

Problem 2: EuroParl and the nested text field

Most datasets store their text in a flat field: example["text"]. EuroParl doesn't. It's a parallel corpus — English and Italian are stored together — so the text lives inside a nested structure: example["translation"]["it"].

My download pipeline expected a simple text_field parameter. EuroParl needed a text_subfield parameter that I hadn't planned for:

"europarl_it": {
    "text_field": "translation",
    "text_subfield": "it",    # Nested: example["translation"]["it"]
}

A small fix in code, but the kind of thing that burns two hours of debugging when the pipeline silently produces empty documents instead of raising an error. The lesson: always validate a sample of your output, not just the pipeline’s exit code.

Problem 3: StarCoderData and the “content” field

Here’s one that almost slipped through. Every Italian and English dataset stores its text in a field called "text". StarCoderData uses "content". My tokenization script reads doc.get("text", "") — which would quietly return empty strings for every code document, tokenize nothing, and produce empty binary files.

I caught this during the health check (more on that below), when the code dataset showed suspiciously low token counts. The fix was adding code-specific field handling, but the root cause was an assumption I’d baked into the pipeline: that all datasets follow the same schema. They don’t.

The Quality Tier System

Not all data is created equal. A Wikipedia article about Renaissance architecture, a legal decree from the Gazzetta Ufficiale, and a random blog post scraped from the web are all “Italian text” — but they contribute very differently to a model’s capability.

I introduced a three-tier quality system that prefixes every tokenized binary file:

Tier 1 (t1) — Curated, filtered, highest quality. FineWeb-Edu, FineWiki (both languages), FinePDFs Italian, Gazzetta Ufficiale, EuroParl. These are the datasets where every document has been filtered for quality, extracted cleanly, or comes from an inherently high-quality source.

Tier 2 (t2) — Well-deduplicated web and code. FineWeb-2 Italian, Cleaned mC4 Italian, StarCoderData. Good data with solid deduplication, but less filtered than Tier 1.

Tier 3 (t3) — Bulk sources. Italian-PD (the public domain books). Massive in volume but noisier — OCR artifacts, archaic spelling, inconsistent formatting.

QUALITY_TIERS = {
    # Tier 1: educational, curated, wiki
    "finepdfs_it": 1, "fineweb_edu": 1, "finewiki_it": 1,
    "finewiki_en": 1, "gazzetta_ufficiale": 1, "europarl_it": 1,
    # Tier 2: well-deduplicated web, code
    "clean_mc4_it": 2, "fineweb2_it": 2, "starcoderdata": 2,
    # Tier 3: bulk sources
    "italian_pd": 3,
}

The tier prefixes serve a dual purpose. During training, a quality-aware data loader can sample from higher tiers more frequently — Tier 1 documents get seen 3x more often than Tier 3. But equally important, the tiers provide debugging visibility. When I tested the model mid-training and noticed it was generating oddly archaic Italian, I could immediately check: is the data loader over-sampling Tier 3? Are the Italian-PD shards dominating the batch?

Without tiers, “the model generates weird text” is a mystery. With tiers, it’s a diagnosis.

tokenized_bin/
  t1_fineweb_edu__shard_00000.bin     ← Tier 1: educational
  t1_gazzetta_ufficiale__shard_00003.bin
  t2_fineweb2_it__shard_00042.bin     ← Tier 2: web
  t2_starcoderdata__shard_00010.bin
  t3_italian_pd__shard_00007.bin      ← Tier 3: bulk

Token Estimation: Why Character Count Matters

Here’s a subtlety that wasted more of my time than it should have.

When you’re assembling a corpus, you need to estimate how many tokens each dataset will produce — before you’ve trained the tokenizer. You can’t tokenize without a tokenizer. But you need token estimates to decide how much data to download, how to balance the language mix, and whether your training budget is feasible.

The naive approach: divide character count by 4 (the rough average across English text). This is wrong for a multilingual corpus.

Italian uses more characters per token than English. Italian words are longer on average — “implementazione” vs “implementation,” “rappresentazione” vs “representation” — and the accented characters add complexity. Code is the opposite: shorter identifiers, dense syntax, more tokens per character.

I used per-language estimation ratios:

CHARS_PER_TOKEN = {
    "it": 3.3,   # Italian: longer words, accents
    "en": 4.0,   # English: standard ratio
    "code": 2.8, # Code: dense, short identifiers
}

The difference between using a flat 4.0 ratio and these per-language ratios is significant. A flat ratio underestimates Italian tokens by ~20% and overestimates code tokens by ~40%. When your training budget is 90 billion tokens and you need to plan a two-week GPU run, a 20% estimation error is the difference between finishing on time and running out of data.

Resumability: The Feature That Pays for Itself

The full corpus download took 3–5 days depending on HuggingFace server load. During that time, things went wrong. Network interruptions. Server restarts. Disk space warnings at 3 AM. HuggingFace rate limits.

Every component in the pipeline was designed to be killed and restarted at any point.

Each dataset tracks its progress in a download_meta.json file. Each shard is checkpointed individually — if the script is killed after downloading 42 of 200 shards, the next run picks up at shard 43. Multiple datasets download simultaneously across parallel workers. The entire thing is a single command:

python 1-download-corpus.py --corpus-dir /dir1/llm_corpus --step all --workers 4

Kill it with Ctrl+C. Run the same command again. It resumes.

Every write to disk uses a temporary file with a .tmp extension, then atomic rename:

# Write to temp file
with open(tmp_path, "wb") as f:
    writer.write(docs)

# Atomic rename — if this succeeds, the shard is complete
os.rename(tmp_path, final_path)

Why? Because a crash at 80% completion — a full disk, a network glitch, a power blip — without atomic writes would leave a corrupted shard that looks complete but contains truncated data. With atomic writes, a crash leaves a .tmp file that the next run knows to delete and redo. The final file either exists in full or doesn't exist at all.

This pattern saved me twice during the Italian-PD conversion and once during the tokenization step.

This sounds obvious when I describe it. But I’ve seen data pipelines at companies — real companies with real budgets — that can’t do this. They download everything from scratch if anything goes wrong. I wanted something better, not because I’m idealistic, but because I was going to be the one waking up at 3 AM to fix it.

The Pre-Tokenization Step

Once the corpus was assembled and validated, every document needed to be converted from text to token IDs — the numerical format that the model actually consumes.

This is a CPU-intensive operation: read every JSONL shard, encode every document with the tokenizer (which I hadn’t trained yet at this point — that’s Article 3), append an EOS (end-of-sequence) token after each document, and write the result as a flat binary file of unsigned 16-bit integers.

Why uint16? Because my vocabulary is 64,000 tokens. The maximum token ID is 63,999, which fits in 16 bits (max value 65,535). Using 16 bits instead of 32 bits halves the disk footprint and doubles the data loading throughput — the training loop reads tokens from memory-mapped files, so smaller files mean faster I/O.

The tokenization uses all available CPU cores via Python’s multiprocessing — every core except two, left for the operating system. Each worker processes one JSONL shard independently. No coordination needed between workers — each shard produces exactly one .bin file. The same atomic write pattern (.tmp → rename) protects against crashes.

A leftover .tmp file from an interrupted run is automatically cleaned up on the next launch:

for tmp in output_dir.glob("*.tmp"):
    tmp.unlink()
    print(f"  Cleaned up partial file: {tmp.name}")

Code datasets get a lower minimum character threshold (20 characters instead of 100) because short code files — a single function, a config snippet — are legitimate training data that shouldn’t be filtered out.

The Health Check

After everything was downloaded, converted, and tokenized, I needed proof that the corpus was intact. Not “it looks fine” — actual automated validation.

The health check scans every dataset and reports: total documents, total characters, estimated tokens, shard count, disk usage, and tier breakdown. It checks for EOS token presence in the binary files — if the tokenization step failed to insert document boundaries, the model would never learn when one document ends and another begins. It catches shard name collisions (two datasets producing files with the same name), leftover .tmp files from interrupted runs, and metadata inconsistencies.

Datasets: 10
Tokenized files: 816
Total tokens: ~447.8B
Disk usage: 834.0 GB
Tier 1: 130.7B (29%)
Tier 2: 285.1B (64%)
Tier 3: 32.0B (7%)

  ✅ No critical data quality issues found!

The health check runs automatically after every download. You can also run it independently:

python 1-download-corpus.py --step check --quick    # sampled, 3 shards per dataset
python 1-download-corpus.py --step check             # full scan, every shard

This is where I caught the StarCoderData "content" vs "text" field issue. The code dataset showed nearly zero tokens in the health check output. Without the automated check, I would have trained on a corpus that was silently missing 10% of its planned data.

What I’d Do Differently

Two things.

First: I’d profile the dataset sizes earlier. The Italian-PD giant-shard problem would have been obvious if I’d looked at average document size before setting my shard parameters. I assumed all datasets would have similar document sizes because all my previous experience was with web-crawl data. Books are not web pages. Know your data.

Second: I’d invest more in data quality filtering. The tier system is a good start, but it’s coarse. Within Tier 2, there’s a wide quality range — some FineWeb-2 documents are excellent, others are boilerplate cookie notices and navigation menus. More aggressive per-document filtering (language detection, perplexity scoring, near-duplicate removal within datasets) would have produced a cleaner training signal. I chose speed over perfection because I was building alone and needed to start training. For a production model, I’d spend another week on data quality.

The Takeaway for Non-Technical Readers

If you’re a decision maker thinking about AI for your business, here’s what this article means for you.

The quality of an AI model is bounded by the quality of its data. This isn’t a metaphor — it’s a mathematical fact. A model trained on noisy, duplicated, poorly structured text will produce noisy, repetitive, poorly structured output. The architecture can’t compensate for bad data.

When you evaluate AI solutions for your company, ask about the data. Not “how many parameters” — that’s the question everyone asks. Ask “what data was it trained on, and how was it cleaned?” The answer tells you more about the model’s real capabilities than any benchmark score.

Dante-2B’s corpus took weeks to assemble, validate, and process — and it’s the foundation everything else stands on.

Next: Article 3 — Training a tokenizer that actually speaks Italian. Why every English tokenizer butchers Italian, the regex hack that keeps apostrophe contractions intact, and the encoding decision that wasted my first attempt.

Training a Tokenizer That Actually Speaks Italian

I’m Fabio — with LEAF I bring emerging technologies to businesses before they go mainstream. At LUISS and LUISS Business School, I teach deep tech to the people who won’t build these technologies, but will decide whether to adopt them.

Assembling 450 Billion Tokens: The Training Data Nobody Had Ready was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.