cs.CV, cs.LG

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

arXiv:2605.06809v1 Announce Type: new
Abstract: Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the nee…