Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)

Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)

Really interesting approach to solving long context rot. Basically a hyper efficient index of KV cache is stored in the GPU's VRAM that points to compressed KV cache stored in system RAM. It requires introduction of new layers and corresponding training to get the model to retrieve the KV cache properly and achieve the long context benefits so it isn't something you can just immediately retrofit but seems like this would be worth the time to do based on the immense benefits it yields. They have a 4B qwen3 model they trained, however, you need to use their custom inference engine to serve it because of its unique architecture (clone and compile their GitHub).

https://arxiv.org/pdf/2603.23516

https://github.com/EverMind-AI/MSA

https://huggingface.co/EverMind-AI/MSA-4B

https://evermind.ai/blogs/breaking-the-100m-token-limit-msa-architecture-achieves-efficient-end-to-end-long-term-memory-for-llms

submitted by /u/ratbastid2000
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top