On Harnessing Idle Compute at the Edge for Foundation Model Training
arXiv:2512.22142v2 Announce Type: replace-cross
Abstract: The foundation-model ecosystem remains highly centralized because training requires immense compute resources and is therefore largely limited to large cloud operators. Edge-assisted foundation model training that harnesses spare compute on edge devices offers a more democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-training performance, scale to larger models, fit within device memory limits, or keep communication overhead manageable. They also do not handle device heterogeneity and churn satisfactorily.
We introduce Cleave, built on a structural insight: each GEMM has an asymmetric I/O pattern -- its input matrices, sent over downlink, are much larger than the partial output blocks returned over uplink -- matching edge networks where downlink bandwidth exceeds uplink by 2--10x. Exploiting this alignment with a parameter-server-centric architecture, Cleave makes per-device communication \emph{decrease} as more devices join, rather than stay constant as in conventional TP. Decomposing training into independent sub-GEMM tasks yields one scheduling abstraction that unifies memory constraints, communication overhead, and fault tolerance under device churn.
Our evaluation shows that Cleave achieves cloud-comparable GPU training performance and outperforms state-of-the-art edge-training methods by 4--10x in per-batch runtime at the same device counts. Beyond this shared operating range, Cleave scales to thousands of heterogeneous devices -- a regime where prior edge-training systems cannot operate -- and achieves at least 100x faster recovery from device failures.