| I got interested in seeing whether I could "compile" a program into transformer weights, instead of training. I've been working on it for a couple of months now but finally decided to just stop and write it up, so this is a bit of a long post but maybe some of you will find it interesting. Basically I define the residual stream as a set of "registers" and generate the attention weights and MLP functions that execute an RPN interpreter (e.g. For now I settled on distilling the non-linear logic into the MLPs by training, but the attention weights are fully calculated by the compiler. I think it could be possible to calculate the MLP weights eventually too but it probably needs more of an AST behind it. In a way it's a sort of useless exercise (who really needs an RPN interpreter that clocks in at 1.1 GB) but see the last bit for some thoughts about how this might have some application. I did learn to think of transformers and attention a bit differently after working on this, so I hope it's interesting to some people out there. [link] [comments] |