DeepSeek Updated their repo DeepGEMM testing Mega MoE

https://github.com/deepseek-ai/DeepGEMM/pull/304

https://github.com/deepseek-ai/DeepGEMM/commit/a050d09461e86eb6bba35a8c74fc0e296e8e16c7#diff-59e30829961e1b429bc12115673562f6f15d2ed347cac8d27a879bf101e977cb
New features Mega MoE, fusing & overlapping dispatch/linear 1/SwiGLU/linear 2/combine into a single mega-kernel, overlapping NVLink communication and tensor core computation

Performance number will be posted later

Only FP8 x FP4 MoE is supported

Only EP <= 8 is tested

Requires PyTorch >= 2.9

FP4 Indexer (MQA logits) with larger MTP support

FP8 x FP4 GEMM

PDL

Refactors on GEMM heuristics

Faster JIT compilation GEMM optimizations (Swap A/B, much faster MoE GEMM)

DeepEPv2 MoE GEMM layout

Bug fixes

JIT may crash on distributed FS

Some kernel hangs and IMA

notes

Mega MoE is still under development and optimizations, stay tuned and optimization ideas are welcome! Disclaimer: this release is only related to DeepGEMM's development, has nothing to do with internal model release.

submitted by /u/External_Mood4719
[link] [comments]

Leave a Comment