[D] How do you guys handle GPU waste on K8s?
I was tasked to manage PyTorch training infra on GKE. Cost keeps climbing but GPU util sits around 30-40% according to Grafana. I am pretty sure half our jobs request 4 GPUs or more and then starve them waiting on data. Right now I’m basically playing …