Cold start latency on GPU cloud platforms in 2026 — p99 specifically, not p50. Anyone have real data? [D]

doing infrastructure evaluation for inference workloads and running into the same problem everywhere: every platform publishes p50 cold start claims or median startup times. nobody publishes p99. and p99 is the number that shows up in support tickets and SLA violations, not p50

what I’m specifically trying to understand:

how does cold start p99 behave under load vs normal conditions — is there meaningful degradation when providers are at high utilization?

does multi-provider pooling actually improve p99 or just p50? the logic seems sound (route to where capacity exists) but I haven’t found published data

how much of cold start is infrastructure queue time vs model loading time? I suspect these are often conflated in marketing claims

for context: running inference workloads on 70B-class models, RTX 5090 and H200 primarily, care deeply about p99 because user-facing latency

anyone have real numbers or methodology for measuring this properly?

submitted by /u/yukiii_6
[link] [comments]

Leave a Comment