FPGAs for speculative decoding

Anyone who knows stuff about fpgas:

- What max model size can one be designed for (I've read 20-30m parameters max, is it possible to go for more if quantized - at a resonable price)?
- Taalas - is what they're doing with asics more viable (rumored? qwen 27b @10k tok/sec at apperantly <$800 hard)

Would speculative decoding here work? Are there other strategies that would be better here, if the smaller model generates at a 100x token speed?

Thanks!

submitted by /u/dp3471
[link] [comments]

Leave a Comment