What should a PyTorch training end-of-run performance summary show? [D]
For most slow PyTorch runs the first question isn't show me every trace event, it is just: where do I even start? – where did step time go? – was the run input-bound, compute-bound, or wait-heavy? – were ranks imbalanced? – was memory stable …