Agents are now writing the GPU kernels that train the agents. The design pattern underneath it is what you should actually steal.

The most interesting ML infra release of April 2026 wasn’t a model.
It was a blog post from LinkedIn’s Liger Kernel team — buried under the standard “AI helping build better AI” framing that nobody clicks on anymore — describing three agentic workflows that now write Triton GPU kernels, integrate new model architectures, and optimize existing kernels with minimal human input.
The actual numbers, from real merged PRs:
- A ReLU² activation kernel: 1.9× forward speedup, 3.2× backward, 37.5% memory reduction. Days of expert time → human reviewing a profile.
- A backward pass for fused RMS norm: 3.35× speedup at hidden dim 16384, 59% full-pass speedup. The agent diagnosed register pressure (115 registers/thread, 12.5% occupancy) and applied four targeted fixes. Zero regressions across 40 tests.
- An internal LinkedIn training job using an agent-generated batched mean-pooling kernel: encoder step time 400ms → 40ms. Training step 1.12s → 0.39s. 64.7% of GPU hours saved end-to-end.
If you’ve spent any time in low-level GPU optimization, those numbers should make you stop scrolling. People with PhDs in this stuff don’t ship 10× speedups on a Tuesday. The agent did, and the human reviewed the diff.
But the speedups aren’t the actual story. The actual story is the design pattern that produced them — and why it generalizes far beyond GPU kernels.

Movement 1: Why GPU kernels are the perfect testbed for agents
Before the recursion gets interesting, you need to understand why GPU kernel engineering is the worst possible place to put an LLM agent — and therefore the most interesting one.
Triton kernel debugging is notoriously brutal. Wrong numbers hide behind precision differences, race conditions, and incorrect mask handling. A kernel that “looks right” can silently corrupt training for thousands of GPU-hours before anyone notices the loss curve is off. Every new model architecture has subtle differences — Gemma upcasts to fp32 where Llama uses partial casting, RMSNorm offsets vary, MoE routing breaks naive patching, RoPE has variants. Get any of these wrong, and you don’t get an error. You get a model that almost trains.
This is the textbook “agents shouldn’t touch this” domain. High blast radius, low feedback density, and expensive to verify.
Liger Kernel is also exactly the kind of project where agentic automation should crush. Every new kernel goes through analysis → implementation → testing → benchmarking. Every model integration resolves a fixed set of architectural decisions. Every optimization pass profiles, classifies, hypothesizes, validates. These are repeatable patterns with clear correctness criteria.
The bet LinkedIn made: encode the domain expertise into structured workflows once, and then let agents execute them under human review. Not “agent writes code.” The agent produces a structured profile capturing every architectural decision; a human verifies the profile; code generation is then deterministic from the profile.
This distinction is the entire post.

Movement 2: What they actually shipped
Three agent skills, each running the same three-stage pipeline (Understand → Act → Verify) with mandatory human checkpoints between stages.
liger-kernel-dev turns a PyTorch operation into a Triton kernel. You feed it a paper, a code snippet, or a description like “ReLU squared activation function.” The agent classifies the operation into one of three complexity tiers — element-wise (Tier 1), reduction (Tier 2), or fused/complex (Tier 3) — and uses existing Liger kernels of the same tier as reference implementations. It generates ~8 files: the Triton kernel, an nn. Module wrapper, a functional API, exports, parametrized unit tests, and benchmarks. The agent doesn’t invent patterns. It applies proven ones.
liger-autopatch adds Liger optimization support for a new HuggingFace Transformers model. The agent reads modeling_*.py and resolves a structured 12-decision matrix — norm type, casting mode, RMSNorm offset, MLP activation pattern, dense vs MoE structure, vision components, RoPE variant — before generating code. Two real PRs in the post: Nemotron and Ministral, both merged with no manual code changes after profile review.
liger-kernel-perf optimizes existing kernels. It runs NCU profiling, classifies the bottleneck (memory-bound, compute-bound, latency-bound), and generates versioned optimization variants. Each variant gets a “lab notebook” tracking hypothesis, changes, and results. The agent reads all prior notebooks before generating the next variant — accumulated learning across iterations. Guardrails reject any variant that regresses a non-target metric by more than 5%.
The fused_add_rms_norm result is the cleanest demonstration in the entire post. NCU profiling on H100 revealed the backward kernel was severely underutilizing the GPU. The agent diagnosed register pressure as the root cause — 8 BLOCK_SIZE-wide vectors live simultaneously at peak — and applied four targeted optimizations: reordering dW before dX for register reuse, factoring the dX formula with a precomputed scalar, deferring a load until freed registers were available, and adding num_stages=2 for Hopper software pipelining.
That’s not “agent autocompletes some code.” That’s a junior infra engineer’s first month of NCU profiling work, compressed into one review cycle.
Movement 3: The design pattern that actually matters
Here’s where most of the takes will stop. Speedups, agentic workflows, cool. Move on.
Don’t.
Read the design principles section of LinkedIn’s post carefully. Five principles, but they collapse into one insight that generalizes far beyond GPU kernels:
The profile is the product.
Not the generated code. The structured intermediate representation that captures every architectural decision before code generation begins.
For kernel-dev, it’s a tier classification + tiling strategy. For autopatch, it’s the 12-decision architectural matrix. For kernel-perf, it’s the bottleneck classification + register pressure analysis. In every case, the agent’s real job is reasoning, not writing. Code generation becomes deterministic once the profile is right.
This inverts the usual agent design failure mode. Most teams point an LLM at a task, watch it produce plausible-looking output, and then bolt validation on at the end. Liger’s workflows force the agent to commit to its decisions in a reviewable artifact first, and then generate code from that artifact. The human’s job is to verify reasoning, not to read code diffs.
The other four principles are scaffolding around this:
Tier-based pattern matching prevents invention. Existing solutions of the same class are the reference. The agent’s creativity is constrained to the dimensions where creativity matters.
Verifiable checkpoints make review tractable. You’re reviewing one structured document per stage, not unbounded code.
Validation as a first-class stage classifies failures into hard gates (import errors, test failures) and soft gates (tolerance tuning). Three failures → stop and report rather than generate increasingly wrong code. Most agentic systems lack this and degrade silently.
Template-driven consistency means the generated code looks like a contributor wrote it. This isn’t aesthetic. It’s what makes review fast and what makes the code maintainable after the agent is done.
If you’re building any agentic system that touches production code — internal tools, infra automation, code review bots, anything — these five principles are the spec. They’re not Liger-specific. They’re what separates “agent helps me write code” from “I verify the agent’s engineering decisions.”

Why is this the real “AI building AI” milestone
Twitter has been calling everything “AI building AI” for two years. Most of it is autocomplete with extra steps.
This is different.
The agents that wrote the fused_add_rms_norm kernel will be used to train the next generation of Liger contributors’ models. Some of those models will be agents. Those agents will write the next round of kernels. The recursion is real, and it’s running in production at LinkedIn right now, optimizing internal recommendation models with 64.7% GPU-hour savings.
The bottleneck for open-source ML infrastructure was never ideas. It was expert time to implement them correctly. Liger Kernel had a list of optimizations the team knew were possible but couldn’t ship fast enough. The community had a list of model architectures users wanted supported but couldn’t get to. The maintainers were doing what every successful open-source project does — running out of time.
LinkedIn just shipped a way to convert “ideas the team knows about” into “merged PRs” without scaling the team. When Meta releases the next Llama, a community member runs liger-autopatch and submits a PR within hours. When NVIDIA ships Blackwell-next, anyone runs liger-kernel-perf to re-optimize without three years of CUDA experience.
That’s not a feature release. That’s a phase change in how open-source infra projects scale.
And here’s the part nobody is saying out loud: this generalizes. Every mature open-source project — PyTorch, vLLM, llama.cpp, Triton itself — has the same shape. Repeatable engineering tasks. Clear correctness criteria. Expert bottleneck. The Liger team just published the playbook. The first project to copy it on a different codebase wins.
What I’d actually do if I were building infra right now
Three concrete takeaways:
If you’re building agentic systems for any structured engineering domain, steal the profile-as-IR pattern wholesale. Don’t generate code directly from natural language input. Force a structured intermediate representation that captures every decision. Make the human review the IR, not the code. The code generation becomes a templated transform.
If you’re maintaining an open-source library, publish your own agent skills. Liger ships the workflows in the repo. Anyone with a coding agent can run them. Your contributor pipeline just got infinite scale, and your maintainer burden just dropped.
If you’re a developer who’s been intimidated by GPU kernel engineering, the moat just shrank. You can now contribute optimized Triton kernels to a major open-source project by understanding the math, not the hardware. That’s not a small change.
The bottleneck for open-source ML infra was never ideas. It was expert time. That bottleneck is gone.
If you found this useful, the full Liger Kernel repo is at github.com/linkedin/Liger-Kernel, and the original LinkedIn engineering post went up on April 21, 2026. Worth reading the actual design principles section — it’s the real meat.
LinkedIn Just Quietly Solved Open-Source ML Infra’s Real Bottleneck — And Almost Nobody Noticed was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.