https://github.com/deepseek-ai/DeepGEMM/pull/304
https://github.com/deepseek-ai/DeepGEMM/commit/a050d09461e86eb6bba35a8c74fc0e296e8e16c7#diff-59e30829961e1b429bc12115673562f6f15d2ed347cac8d27a879bf101e977cb
New features Mega MoE, fusing & overlapping dispatch/linear 1/SwiGLU/linear 2/combine into a single mega-kernel, overlapping NVLink communication and tensor core computation
Performance number will be posted later
Only FP8 x FP4 MoE is supported
Only EP <= 8 is tested
Requires PyTorch >= 2.9
FP4 Indexer (MQA logits) with larger MTP support
FP8 x FP4 GEMM
PDL
Refactors on GEMM heuristics
Faster JIT compilation GEMM optimizations (Swap A/B, much faster MoE GEMM)
DeepEPv2 MoE GEMM layout
Bug fixes
JIT may crash on distributed FS
Some kernel hangs and IMA
notes
Mega MoE is still under development and optimizations, stay tuned and optimization ideas are welcome! Disclaimer: this release is only related to DeepGEMM's development, has nothing to do with internal model release.
[link] [comments]