DFlash Promises up to 6x Speed for LLMs — Does It Live Up To It?

I benchmarked three implementations, and learned something useful about why long-context speculative decoding is actually slower…