been thinking about this problem for a while and most solutions I’ve found either solve part of it or introduce their own complexity
the specific problem: we want to be able to run GPU workloads across multiple providers for availability and cost reasons, but maintaining provider-specific deployment configuration for each workload doesn’t scale. when a provider has availability issues or pricing changes we want to shift workloads without it being an engineering project
approaches I’ve looked at:
K8s with multi-provider node pools: works but the provider-specific configuration lives in the scheduling logic and GPU driver setup. “portable” means “requires significant modification per provider” in practice. also the failure recovery for GPU-specific failures requires a lot of custom logic that ends up being provider-specific anyway
Terraform: solves infrastructure provisioning portability. doesn’t solve workload scheduling portability. you can terraform your way to nodes on multiple providers, you still need to tell each workload where to run
custom abstraction layer: we built one. it worked until a provider changed their API. maintenance overhead compounded
the pattern that seems to actually work from what I’ve read: separating the workload definition from the infrastructure binding entirely, with a scheduling layer that handles the matching. define what the workload needs, not where it runs. let the scheduler figure out placement across available hardware
curious if anyone has implemented this in practice and what that looks like operationally
[link] [comments]