Smoothing Slot Attention Iterations and Recurrences
arXiv:2508.05417v3 Announce Type: replace
Abstract: Slot Attention (SA) lies at the heart of mainstream Object-Centric Learning (OCL). Image features can be aggregated into object-level representations by SA iteratively refining cold-start query slots. For video, such aggregation proceeds by SA recurrently shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with our SmoothSA: (1) To smooth SA iterations on image or video's first frame, we preheat cold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, we differentiate the homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/SmoothSA.