FlashAttention: Fast Transformer training with long sequences

By Adept — Blog / January 17, 2023

‍Transformers have grown deeper and wider, but training them on long sequences remains difficult. The attention layer at their heart is the compute and memory bottleneck: doubling the sequence length would quadruple the runtime and memory requirements.

Leave a Comment