The joy and pain of training an LLM from scratch

mii-llm just released a detailed technical report on the development of the Zagreus and Nesso model families: a set of 0.4B parameter language models trained from scratch with a focus on edge deployment, multilingual capability, and European languages.

The report documents the full pipeline behind a family of small language models designed for Italian, Spanish, French, and Portuguese, with bilingual pretraining centered on English + target language settings.

Released models

Zagreus-0.4B-ita — English/Italian base model
Zagreus-0.4B-spa — English/Spanish base model
Zagreus-0.4B-fra — English/French base model
Zagreus-0.4B-por — English/Portuguese base model
Nesso-0.4B-instruct — post-trained for conversational use
Nesso-0.4B-agentic — post-trained for structured / agentic tasks
Open-Zagreus-0.4B — fully open variant built with open data and open recipes

Training setup

According to the report, the project used:

64 NVIDIA A100 GPUs
~1 trillion tokens
Datatrove for tokenization
Hugging Face Nanotron for pretraining
Axolotl for post-training
Slurm for multi-node orchestration

The report also explains why a dense 0.4B architecture was selected instead of MoE, arguing that in the sub-1B regime, stability and utilization can matter more than sparse efficiency.

Why this is interesting

A lot of current discussion focuses on frontier-scale models, but this report is a useful example of the opposite direction: small models trained from scratch for practical multilingual edge scenarios.

Some points that stand out:

small multilingual models can still be competitive if the pipeline is well engineered
post-training has a major effect on usability
model behavior differs significantly across Italian and English tasks
open pipelines can still produce meaningful results in this size class
small models still show clear weaknesses in arithmetic, factual recall, repetition, and domain-specific knowledge

Benchmark notes

The report includes comparisons against Qwen3-0.6B and Qwen3.5-0.8B, along with multilingual evaluations and task-by-task analysis.

A few interesting takeaways:

Nesso-0.4B-agentic appears especially strong and consistent on Italian tasks
Qwen3.5-0.8B performs better on several English generative tasks
Qwen3-0.6B stands out on logic / reasoning-style tasks
the fully open variant still achieves competitive results in several settings

Figures

llm-as-judge comparison

https://preview.redd.it/1kw9luyvhpvg1.png?width=1935&format=png&auto=webp&s=f8781a4c64ab51d00853d84120541925d8674c54

https://preview.redd.it/q2hj6vz2ipvg1.png?width=2385&format=png&auto=webp&s=8d4484384743eacbb119896b18f91f894a8eb839

Classical benchmark

https://preview.redd.it/ri1vkdz9gpvg1.png?width=630&format=png&auto=webp&s=f889f5e16366537cc534e50e7921669d8d95fa68

Italian benchmark results

https://preview.redd.it/0ounb0negpvg1.png?width=630&format=png&auto=webp&s=df6fb43e4348795d1a0bd36e98954c6f7afa432e

English benchmark results english-nesso.png

https://preview.redd.it/ttq58dtggpvg1.png?width=630&format=png&auto=webp&s=b2f029b6c6cf310176e11f419826b56ad97c40db

Main takeaway

This is a solid case study on what it actually looks like to train a small multilingual LLM from scratch in 2026: tokenization, storage, Slurm orchestration, distributed training, post-training, evaluation, and model release.

For anyone interested in small language models, multilingual training, edge deployment, or open LLM engineering, the report is worth a read.

submitted by /u/kazzus78
[link] [comments]