STORM: Segment, Track, and Object Re-Localization from a Single Image

arXiv:2511.09771v3 Announce Type: replace Abstract: Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top