Speculative Decoding

I've started looking into what speculative decoding is/how it works in the past 30 minutes. I realize this is not a lot of time to try to understand something and hope you will forgive me. I have a cognitive block about this question now that I feel like I have to resolve first.

Here's my confusion:

There appears to be a claim that the model quality remains just as good as if you only used the target model (big model), but this doesn't sit right with me. If we allow the smaller model to quickly generate 1-4 other tokens, we are relying on that model's self-attention and feed forward network (FFN) to generate those tokens, are we not? So even if we present those as input to the larger target model, we are not utilizing the target model's training on self-attention or FFN. It seems to me that we are only relying on its decoder layer, which would bypass a lot of the quality of the inference, wouldn't it?

I realize that for words like if/the/and/of/etc. we are likely almost wasting tokens with how little information those tokens usually possess, so that by and large, the model would end up with the same result. But what if the sequence being generated is highly specific and information-dense, or is outside of the parameter space of the smaller model? Wouldn't we lose the opportunity to use the larger model's intelligence, and be none the wiser that it even happened, or is the larger model's decoder just that good?

And an adjacent question, if you guys don't mind: how can the token embeddings produced by the fast model (which if I understood correctly, have not yet passed through decoding) be used in the target models decoder? Would they not be completely different embedding spaces? The explanation I saw glossed over this - do they have to be transformed into the embedding space of the target model somehow?

Maybe I am not understanding how it works correctly. I would appreciate some of the smart people here helping me grasp the concept better. Thanks!

Edit: Also, I realize I can just ask an LLM, but for once I thought it would be good to ask a public question because the answers may be helpful to others. That used to be a thing lol.

submitted by /u/hesperaux
[link] [comments]

Leave a Comment