Decoupled Attention from Weights – Gemma 4 26B

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql

edit: just found https://www.youtube.com/watch?v=1jGR4zqpyKA for excellent overview of what's happening here.

submitted by /u/yeah-ok
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top