Why I’m Training an Italian Language Model from Scratch — With Two GPUs and No Funding

Italy is building its own AI models. Big companies, supercomputers, government backing. I’m doing it alone, on a single server, and I think that’s exactly the point.

Fabio Angeletti — PhD in Computer Engineering (La Sapienza), Adjunct Professor at LUISS and LUISS Business School, Founder & CEO of LEAF. I bring emerging technologies to businesses before they go mainstream. This is the first article in a series documenting the full engineering journey of Dante-2B, a bilingual Italian/English language model trained from scratch.

Let me start with a confession.

I’ve failed three times with LEAF, my company. A legal research tool during COVID: the market didn’t care. A UV mask adapter: the pandemic ended overnight. Sigillum, an anti-counterfeit device: brilliant engineering, economics that didn’t add up (yet).

Three products. Zero revenue in year one.

Each failure taught me the same lesson from a different angle: the gap between a research paper and a product that someone will pay for is the most interesting place in the world. And the most dangerous.

I’m about to walk into that gap again. This time, with a language model.

Italy Is Already Building AI Models. So Why Am I Building Another One?

Let me acknowledge the obvious. Italy is not standing still.

iGenius built Modello Italia — a 9-billion parameter model trained from scratch on the Leonardo supercomputer at CINECA in Bologna, aimed at the public administration. Fastweb developed MIIA, a 7B model trained on 1.5 trillion tokens using Italy’s first enterprise-grade NVIDIA DGX system. Almawave released the Velvet family — models from 2B to 25B parameters, multilingual, with privacy-preserving mechanisms, under Apache 2.0 license. On the academic side, my own alma mater Sapienza produced Minerva and the original DanteLLM, while Bari gave us LLaMAntino and Pisa contributed Cerbero.

Italy has an AI ecosystem. It’s real, it’s growing, and some of these projects are genuinely excellent.

So why am I building yet another Italian model?

Because almost every one of these projects shares the same DNA — and the same blind spot.

The corporate models require corporate infrastructure. Modello Italia was trained on Leonardo, one of the most powerful supercomputers in Europe. Fastweb MIIA runs on a proprietary DGX cluster. These are serious capabilities, but they’re locked behind institutional access. A 50-person company in Brescia that wants to fine-tune an Italian model on their internal documents can’t call CINECA and book time on Leonardo.

The academic models are adaptations, not foundations. LLaMAntino is a fine-tune of Meta’s LLaMA. Cerbero adapts Mistral. Maestrale applies instruction tuning to an existing base. These are valuable — parameter-efficient fine-tuning is smart engineering. But the foundation is still an English model. The tokenizer is still English-first. The byte-level representation still breaks Italian accented vowels into meaningless fragments. You can teach LLaMA to speak Italian, but you can’t unteach it that Italian is a second language.

Nobody is showing the work. This is what bothers me most. Every Italian AI project announces results. Nobody publishes the loss curves, the failed experiments, the bugs that cost them a week. Nobody explains how to build a model from scratch in a way that another engineer could reproduce. The knowledge stays locked inside corporate R&D departments and academic labs. That’s the opposite of what an ecosystem needs to grow.

What I’m Actually Building — and What Makes It Different

Dante-2B is a 2.1-billion parameter language model trained entirely from scratch — not fine-tuned from LLaMA, not adapted from Mistral, not distilled from a larger model. Every weight starts from random noise and learns from data.

Four things make it different from everything else in the Italian landscape.

First: the tokenizer is Italian-native. This sounds like a detail. It’s not. Every model I just listed — whether trained from scratch or fine-tuned — uses a tokenizer that was either designed for English or designed for “multilingual” (which really means “English plus whatever else fits”). Italian apostrophe contractions get split into three pieces. Accented vowels get encoded as two bytes. The model wastes capacity re-learning basic Italian orthography.

I built a custom 64,000-token BPE tokenizer with an Italian-aware pre-tokenization regex that keeps “dell’algoritmo” as a single unit, pre-merged accented characters so “è” is always one token, and character-balanced training data (not document-balanced — a subtle but critical distinction I’ll explain in Article 3). The result: Dante-2B processes Italian text significantly more efficiently than LLaMA-based models, which means more content per token, which means a larger effective context window for Italian text.

Second: it runs on two GPUs. Not a supercomputer. Not a DGX cluster. Two server-grade GPUs on a single machine, with about 140 GB of memory each. Total training time: about two weeks for Phase 1, plus another few days for context extension.

This matters because it proves a point. You don’t need CINECA to build a useful Italian model. You don’t need government funding or a telecom company’s hardware budget. A single engineer with the right architecture choices, the right training pipeline, and enough stubbornness can produce a foundation model on hardware that a mid-size company could actually afford.

If the minimum viable path to an Italian AI model requires a supercomputer, then Italian AI will always depend on a few gatekeepers. If it can be done on two GPUs, the entire game changes.

Third: every step is documented. The corpus assembly. The tokenizer design. The architecture decisions — including the wrong ones. The bugs. The failed optimizations. The checkpoint nightmares. This article series is the documentation that I wished existed when I started.

Fourth: everything will be released open-source. The model weights on HuggingFace. The full training code, data pipeline, tokenizer, and every script on GitHub. Not “open-weight with a restrictive license” — actually open, so anyone can reproduce, fine-tune, or build on top of what I’ve done.

This isn’t generosity. It’s a bet.

The Italian AI ecosystem doesn’t need another corporate model behind an enterprise agreement. It needs a reproducible path that a startup in Naples, a research group in Torino, or a developer in Palermo can actually follow. If Dante-2B becomes the starting point for ten other Italian models built by people I’ve never met, that’s a better outcome than any product I could build alone.

The Numbers — Honest and Uncomfortable

Let me be transparent about the scale, because I think honesty about constraints is more useful than pretending they don’t exist.

Dante-2B trains on a fraction of the tokens that Minerva or MIIA consumed. That’s the honest truth.

But here’s the nuance. Those trillion-token budgets include massive amounts of English data. Minerva trained on 2.5 trillion tokens, but how much of that was Italian? The published description mentions Italian, English, and code — let’s generously assume 40% Italian, which gives roughly one trillion Italian tokens. Dante-2B dedicates about 45 billion tokens exclusively to curated Italian data. That’s 20x less — significant, but not the 25x gap the headline numbers suggest.

And Dante-2B’s Italian data isn’t scraped web noise. It’s the Gazzetta Ufficiale, European Parliament proceedings, 171,000 Italian public domain books, FineWiki, and the best-filtered Italian web crawl available. Quality over quantity — the same principle that makes a specialized boutique competitive against a department store.

The hypothesis is testable: a smaller, focused model with a native tokenizer can match or exceed a larger, multilingual model on Italian tasks — because every parameter is working for Italian, not diluted across dozens of languages. And once the model, the code, and the data pipeline are public, anyone can verify that claim.

Why Now, Why Me

I have a PhD from Sapienza — the same university that produced Minerva and DanteLLM. I teach AI to business, finance, and law students at LUISS. I run LEAF, where we bring emerging tech to companies. I live in the gap between research and business every single day.

Every week, in both worlds, I see the same pattern. Italian companies want to use AI for document analysis, contract review, customer interaction — in Italian. They try the available options. The corporate Italian models are locked behind enterprise agreements. The fine-tuned open models still think in English underneath. The global models treat Italian as an afterthought.

There’s a window right now where the hardware is accessible but the right models don’t exist yet. Server-grade GPUs with 140+ GB of memory make it possible to train a 2B model in weeks, not months. Two years ago, this required a cluster. Two years from now, there might be off-the-shelf options that make this unnecessary.

I’m in that window. And I know how to build.

What This Series Will Cover

I’m documenting the entire process — not the polished version where everything works on the first try, but the real version with the bugs, the wrong decisions, and what I learned from each one.

Over the next articles I’ll share real code, real error logs, real numbers. The kind of detail that the Italian AI ecosystem publishes in press releases but never in engineering blogs.

Why I’m Writing This in Public

The Italian AI ecosystem has a communication problem.

The corporate players announce products. The academics publish papers. Nobody bridges the two in a way that a CTO in Milan or a startup founder in Turin can actually use. The engineering knowledge — the messy, practical knowledge of how to actually build these things — stays hidden behind NDAs and institutional walls.

I don’t think it should.

I’m releasing Dante-2B — model weights, training code, tokenizer, data pipeline, every script — as fully open-source on GitHub and HuggingFace. Not because I’m altruistic, but because I believe the Italian AI ecosystem grows faster when the engineering is public. When a developer in Bologna can clone a repo and train their own model. When a research group can build on real, documented code instead of starting from zero. When a company can evaluate whether to fine-tune an open Italian model instead of signing an enterprise contract they can’t afford.

The model is a tool. The code is an artifact. But the real contribution is the knowledge — and knowledge only compounds when it’s shared.

If you’re running a business and wondering what Italian AI can actually do for you — not the press-release version, but the real capabilities and limitations — this series is for you.

If you’re a developer or researcher thinking about building a model for a specific language or domain — this is the playbook I wished existed.

If you’re an Italian policy maker trying to understand whether the country’s AI investments are producing real, reproducible capabilities — read on.

And if you’re one of my students at LUISS wondering what the professor does when he’s not in the classroom — now you know.

Let’s build.

Next in the series

Assembling 450 Billion Tokens: The Training Data Nobody Had Ready

I’m Fabio — with LEAF I bring emerging technologies to businesses before they go mainstream. At LUISS and LUISS Business School, I teach deep tech to the people who won’t build these technologies, but will decide whether to adopt them.

Why I’m Training an Italian Language Model from Scratch — With Two GPUs and No Funding was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.