Hey there people. So let's talk about GEMMA 4 per layer embeddings. How far can they go? Are they streamlined clear-cut knowledge stored inside of those embeddings, while the model parameters are just for logic? Or is it like all other LLM phenomena where nothing can be said to be responsible for one single aspect of the entire performance? If it is a clear-cut storage of knowledge that the model uses as a lookup table, how far could it go and can more knowledge be added? Can the embeddings be multiplied so that 20 billion of those parameters are just for the embeddings, while the model itself is just the same 2 billion? Sorry if this question is stupid, but I am very, very interested in small models due to my lacking GPU. (I do not have any). Thanks.
[link] [comments]