Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
arXiv:2410.21316v2 Announce Type: replace
Abstract: Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, …