Attention Is All You Need, But All You Can’t Afford

Repo: https://codeberg.org/JohannaJuntos/Sisyphus

I've been building a small Rust-focused language model from scratch in PyTorch. Not a finetune — byte-level, trained from random init on a Rust-heavy corpus assembled in this repo.

The run:

25.6M parameters
512 context length
173.5M-byte corpus
30k training steps
Single RTX 4060 Ti 8GB
Final train loss: 0.5834 / val loss: 0.8217 / perplexity: 2.15
Inference: 286.6 tok/s with HybridAttention + KV cache — 51.47x vs full attention

Background

I'm an autistic systems programmer, writing code since 2008/2009, started in C. I approach ML like a systems project: understand the data path, understand the memory behavior, keep the stack small, add complexity only when justified. That's basically the shape of this repo.

Architecture

Byte-level GPT-style decoder:

Vocab size 256 (bytes)
8 layers, 8 heads, 512 embedding dim
Learned positional embeddings
Tied embedding / LM head weights

The attention block is not standard full attention. Each layer uses HybridAttention, combining:

Local windowed causal attention
A GRU-like recurrent state path
A learned gate mixing the two

Local path handles short-range syntax. Recurrent path carries compressed long-range state without paying quadratic cost. Gate bias initialized to ones so early training starts local-biased.

The inference path uses Triton-optimized kernels and torch.library custom ops for the local window attention.

Corpus

This is probably the most important part of the repo.

The run starts with official Rust docs, compiler/library/tests, cargo, rust-analyzer, tokio, serde, ripgrep, clap, axum — roughly 31MB. Corpus expanded to 177,151,242 bytes by fetching the top 500 crates (461 successful clones).

Corpus expansion from 31M to 173.5M chars helped more than anything else in the repo.

Training

AdamW, lr 2e-4, weight decay 0.1, betas (0.9, 0.95), 30k steps, 1k warmup. ~678.8 MiB training memory on a 7.6 GiB card.

All experimental memory tricks (gradient quantization, activation compression, selective backprop, gradient paging) were disabled. Small custom architecture + mixed precision + better corpus was enough.

Loss curve:

Step 0: train 5.5555 / val 5.5897
Step 1000: train 2.4295 / val 2.6365
Step 5000: train 0.9051 / val 1.0060
Step 10000: train 0.8065 / val 0.8723
Step 18500: train 0.6902 / val 0.7757
Step 29999: train 0.5834 / val 0.8217

Best val loss around step 18.5k — overfitting or plateauing late.

Inference performance

Full attention O(n²): 17.96s / 5.6 tok/s
HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s
Speedup: 51.47x — no quality loss

KV cache strategy: hot window of W=64 tokens in VRAM (~256KB), older tokens compressed to 8-bit magnitude + angle, selective promotion on demand. Complexity goes from O(n²·d) to O(4096n) for this model.

All 5 tests passing: forward pass, generation with/without cache, RNN state isolation, window mechanics.

Generation quality

Surface Rust syntax looks decent, imports and signatures can look plausible, semantics are weak, repetition and recursive nonsense still common. Honest read of the current state.

What I think is actually interesting

Four distinct experiments, each shipped working code:

Byte-level Rust-only pretraining
Hybrid local-attention + recurrent block replacing standard full attention
Corpus expansion from core repos to broader crate ecosystem
Production-ready hot/cold KV cache paging — 51.47x speedup, no quality loss

The clearest win is corpus expansion. The second-order win is that HybridAttention + cache is fast enough for real interactive use on consumer hardware.

What's next

Ablation — HybridAttention vs local-only vs RNN-only
Checkpoint selection — does step 18.5k generate better than 29999?
Syntax validation — does the output parse/compile/typecheck?
Context length sweep — 256 to 2048, where does window size hurt?
Byte vs BPE — now that corpus is 5.6x larger, worth testing?

Questions for the sub:

For small code models, what evals have actually been useful beyond perplexity?
Has anyone seen hybrid local + recurrent attention work well for code gen, or does it usually lose to just scaling a plain transformer?
If you had this setup — more tokens, longer context, or cleaner ablation first?

submitted by /u/Inevitable_Back3319
[link] [comments]

Attention Is All You Need, But All You Can’t Afford | Hybrid Attention

Leave a Comment