FlashAttention: Fast Transformer training with long sequences

‍Transformers have grown deeper and wider, but training them on long sequences remains difficult. The attention layer at their heart is the compute and memory bottleneck: doubling the sequence length would quadruple the runtime and memory requirements.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top