Author name: /u/bassrehab

[P] Fused MoE Dispatch in Pure Triton: Beating CUDA-Optimized Megablocks at Inference Batch Sizes

/u/bassrehab / April 5, 2026

I built a fused MoE dispatch kernel in pure Triton that handles the full forward pass for Mixture-of-Experts models. No CUDA, no vendor-specific code. On Mixtral-8x7B (A100), it beats Stanford's Megablocks at inference-relevant batch sizes (131% at…

LocalLLaMA

I wrote a fused MoE dispatch kernel in pure Triton that beats Megablocks on Mixtral and DeepSeek at inference batch sizes

/u/bassrehab / April 5, 2026

Been working on custom Triton kernels for LLM inference for a while. My latest project: a fused MoE dispatch pipeline that handles the full forward pass in 5 kernel launches instead of 24+ in the naive approach. Results on Mixtral-8x7B (A100): Tokens…