Repo: https://codeberg.org/JohannaJuntos/Sisyphus
I've been building a small Rust-focused language model from scratch in PyTorch. Not a finetune — byte-level, trained from random init on a Rust-heavy corpus assembled in this repo.
The run:
- 25.6M parameters
- 512 context length
- 173.5M-byte corpus
- 30k training steps
- Single RTX 4060 Ti 8GB
- Final train loss: 0.5834 / val loss: 0.8217 / perplexity: 2.15
- Inference: 286.6 tok/s with HybridAttention + KV cache — 51.47x vs full attention
Background
I'm an autistic systems programmer, writing code since 2008/2009, started in C. I approach ML like a systems project: understand the data path, understand the memory behavior, keep the stack small, add complexity only when justified. That's basically the shape of this repo.
Architecture
Byte-level GPT-style decoder:
- Vocab size 256 (bytes)
- 8 layers, 8 heads, 512 embedding dim
- Learned positional embeddings
- Tied embedding / LM head weights
The attention block is not standard full attention. Each layer uses HybridAttention, combining:
- Local windowed causal attention
- A GRU-like recurrent state path
- A learned gate mixing the two
Local path handles short-range syntax. Recurrent path carries compressed long-range state without paying quadratic cost. Gate bias initialized to ones so early training starts local-biased.
The inference path uses Triton-optimized kernels and torch.library custom ops for the local window attention.
Corpus
This is probably the most important part of the repo.
The run starts with official Rust docs, compiler/library/tests, cargo, rust-analyzer, tokio, serde, ripgrep, clap, axum — roughly 31MB. Corpus expanded to 177,151,242 bytes by fetching the top 500 crates (461 successful clones).
Corpus expansion from 31M to 173.5M chars helped more than anything else in the repo.
Training
AdamW, lr 2e-4, weight decay 0.1, betas (0.9, 0.95), 30k steps, 1k warmup. ~678.8 MiB training memory on a 7.6 GiB card.
All experimental memory tricks (gradient quantization, activation compression, selective backprop, gradient paging) were disabled. Small custom architecture + mixed precision + better corpus was enough.
Loss curve:
- Step 0: train 5.5555 / val 5.5897
- Step 1000: train 2.4295 / val 2.6365
- Step 5000: train 0.9051 / val 1.0060
- Step 10000: train 0.8065 / val 0.8723
- Step 18500: train 0.6902 / val 0.7757
- Step 29999: train 0.5834 / val 0.8217
Best val loss around step 18.5k — overfitting or plateauing late.
Inference performance
- Full attention O(n²): 17.96s / 5.6 tok/s
- HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s
- Speedup: 51.47x — no quality loss
KV cache strategy: hot window of W=64 tokens in VRAM (~256KB), older tokens compressed to 8-bit magnitude + angle, selective promotion on demand. Complexity goes from O(n²·d) to O(4096n) for this model.
All 5 tests passing: forward pass, generation with/without cache, RNN state isolation, window mechanics.
Generation quality
Surface Rust syntax looks decent, imports and signatures can look plausible, semantics are weak, repetition and recursive nonsense still common. Honest read of the current state.
What I think is actually interesting
Four distinct experiments, each shipped working code:
- Byte-level Rust-only pretraining
- Hybrid local-attention + recurrent block replacing standard full attention
- Corpus expansion from core repos to broader crate ecosystem
- Production-ready hot/cold KV cache paging — 51.47x speedup, no quality loss
The clearest win is corpus expansion. The second-order win is that HybridAttention + cache is fast enough for real interactive use on consumer hardware.
What's next
- Ablation — HybridAttention vs local-only vs RNN-only
- Checkpoint selection — does step 18.5k generate better than 29999?
- Syntax validation — does the output parse/compile/typecheck?
- Context length sweep — 256 to 2048, where does window size hurt?
- Byte vs BPE — now that corpus is 5.6x larger, worth testing?
Questions for the sub:
- For small code models, what evals have actually been useful beyond perplexity?
- Has anyone seen hybrid local + recurrent attention work well for code gen, or does it usually lose to just scaling a plain transformer?
- If you had this setup — more tokens, longer context, or cleaner ablation first?
[link] [comments]