What should a PyTorch training end-of-run performance summary show? [D]

For most slow PyTorch runs the first question isn't show me every trace event, it is just: where do I even start?

- where did step time go?
- was the run input-bound, compute-bound, or wait-heavy?
- were ranks imbalanced?
- was memory stable or creeping up?

I haven been thinking about what a compact end-of-run summary would look like: lightweight enough to run on every job, not just dedicated profiling runs.

Here's one example of what that output could look like:

https://preview.redd.it/2q71s9ltkvzg1.png?width=533&format=png&auto=webp&s=cde99ed3224d723bb6dba200b326da826ba4f587

Curious how others are solving this today. What would make something like this useful? What is missing?

submitted by /u/traceml-ai
[link] [comments]

Leave a Comment