Hey r/LocalLLaMA,
Production ML compiler stack is brutal: TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. XLA, MLIR, Halide, Mojo.
It is, arguably, the most important piece of modern compute infrastructure, and there is little information available on core concepts. To fill the gap, I built a small ML compiler from scratch: pure Python and raw CUDA, no library use. It takes a small transformer (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. On RTX 5090, the autotuned stack lands at a geomean of 0.96× vs. the PyTorch production stack, with 32 of 84 kernel shapes beating PyTorch hand-optimized kernels (max 5.6× speedup).
After a month of work, the three-part series is finally finished:
Part 1 walks an RMSNorm layer end-to-end through the upper half of the pipeline: - Torch IR — captured FX graph (rmsnorm, linear, softmax, ...) - Tensor IR — every op decomposed into Elementwise / Reduction / IndexMap - Loop IR — a kernel written as a loop nest fused with other kernels - Tile IR — a kernel scheduled onto the GPU (threads, blocks, shared memory) - Kernel IR — schedule materialized into hardware primitives - CUDA — emitted source ready for nvcc
For example, a PyTorch expression walked through IR levels: python torch.relu(torch.matmul(x + bias, w)) # x: (16, 64), bias: (64,), w: (64, 16) Torch IR: bias_bc = bias[j] -> (16, 64) float32 add = add(x, bias_bc) -> (16, 64) float32 matmul = matmul(add, w, has_bias=False) -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 Tensor IR: bias_bc = bias[j] -> (16, 64) float32 w_bc = w[j, k] -> (16, 64, 16) float32 add = add(x, bias_bc) -> (16, 64) float32 add_bc = add[i, j] -> (16, 64, 16) float32 prod = multiply(add_bc, w_bc) -> (16, 64, 16) float32 red = sum(prod, axis=-2) -> (16, 1, 16) float32 matmul = red[i, na, j] -> (16, 16) float32 relu = relu(matmul) -> (16, 16) float32 Loop IR python === merged_relu -> relu === for a0 in 0..16: # free (M) for a1 in 0..16: # free (N) for a2 in 0..64: # reduce (K) in0 = load bias[a2] in1 = load x[a0, a2] in2 = load w[a2, a1] v0 = add(in1, in0) # prologue (inside reduce) v1 = multiply(v0, in2) acc0 <- add(acc0, v1) v2 = relu(acc0) # epilogue (outside reduce) merged_relu[a0, a1] = v2
Part 2 explains the lower half: how a loop nest becomes a GPU schedule. Sixteen mechanical Tile-IR passes to split computations into blocks, map to threads, stage inputs into smem, etc. Each pass is one diff in the CLI. It mimics the sequence of optimizations a CUDA engineer would make.
For example, a pass that stages inputs into the smem: bash deplodock compile \ -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \ --ir tile -vv \ | awk '/^>>> t:007/,/^<<< t:007/' ```diff
t:007_stage_inputs @@ matched at rms_norm (in-place) @@ @@ -2,6 +2,7 @@ v0 = reciprocal(2048) Tile(axes=(a0:256=THREAD, a1:32=BLOCK)): + x_smem = Stage(x, origin=(0, a1, 0), slab=(a2:2048@2)) StridedLoop(a2 = a0; < 2048; += 256): # reduce - in2 = load x[0, a1, a2] + in2 = load x_smem[a2] v1 = multiply(in2, in2) acc0 <- add(acc0, v1) @@ -11,5 +12,5 @@ v4 = rsqrt(v3) StridedLoop(a2 = a0; < 2048; += 256): # free - in3 = load x[0, a1, a2] + in3 = load x_smem[a2] in4 = load p_weight[a2] v5 = multiply(in3, v4) <<< t:007_stage_inputs ```
Part 3 finishes the series with autotuning. Every parameter in part 2 (block size, register tile, K-chunk, whether to stage, whether to double-buffer, etc.) was hand-picked using heuristics. Those heuristics worked on the shapes I fit them to, but fell over elsewhere. Part 3 replaces heuristics with a search loop: SP-MCTS over the cross-product of rule parameters.
The whole pipeline is one CLI: ```bash
inspect any IR stage
deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda
bench end-to-end
deplodock run --bench -c "torch.nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"
autotune on the live GPU
deplodock tune -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" -v
full model
deplodock compile Qwen/Qwen2.5-7B ```
Each part is self-contained enough that you can skip ahead if you only care about one layer: - Part 1. IR Hierarchy — From PyTorch to Emitted CUDA - Part 2. Tile IR — Scheduling Loops onto a GPU - Part 3. Autotuning — A Search Loop Over Tile-IR Rewrites - Repo: https://github.com/cloudrift-ai/deplodock
[link] [comments]