Memory Sparse Attention seems to be a novel approach to long context (up to 100M tokens)
Really interesting approach to solving long context rot. Basically a hyper efficient index of KV cache is stored in the GPU's VRAM that points to compressed KV cache stored in system RAM. It requires introduction of new layers and correspondi…