this community obviously cares about running models locally. but i've been wondering if the bigger problem is training, not inference
local inference is cool but the models still get trained in datacenters by big labs. is there a path where training also gets distributed or is that fundamentally too hard?
not talking about any specific project, just the concept. what would it take for distributed training to actually work at meaningful scale? feels like the coordination problems would be brutal
[link] [comments]