235M param LLM from scratch on a single RTX 5080

Hey everyone,

Been working on this for a while and figured I'd share it here too. I made a small transformer language model completely from scratch in PyTorch. No pretrained weights, no HuggingFace downloads. Every parameter was trained from raw text on a single consumer GPU.

Current release is Plasma 1.0 (235M params, 18 layers, hidden size 1024). LLaMA-style: GQA with 16 query heads and 4 KV heads (head_dim 64), SwiGLU FFN with 2816 intermediate dim, RoPE with theta 10000, RMSNorm pre-norm, tied embeddings. 32k SentencePiece BPE vocab. bf16 mixed precision with gradient checkpointing to fit on a 5080, trained ~5B tokens at seq len 1024.

I also wrote the full pipeline myself:

Data from FineWeb-Edu, Wikipedia, StackExchange, code, and ArXiv
Quality and toxicity filtering
MinHash deduplication
Custom SentencePiece tokenizer
Domain-weighted data mixing
Pretraining and instruction tuning with loss masking so it only learns from assistant tokens

Some sample outputs after instruct tuning:

You: When was World War 1? 1386.ai: World War I began on June 26, 1914.

You: What is a steak made of? 1386.ai: A steak can be made from various types of meat, including beef.

It's obviously not competing with Llama 3. There are hallucinations, odd outputs, and a pretty hard ceiling at this scale. But doing it this way taught me way more than just fine-tuning a larger model would have.

Plasma 1.1 is currently training (500M params), aiming for better multi-turn and a larger vocab with byte fallback.

Repo: github.com/eb1386/1386.ai

Happy to answer any questions about the pipeline or architecture choices.

submitted by /u/ExcellentTip9926
[link] [comments]

Leave a Comment