Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql
edit: just found https://www.youtube.com/watch?v=1jGR4zqpyKA for excellent overview of what's happening here.
[link] [comments]