FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

arXiv:2603.07690v2 Announce Type: replace Abstract: Streaming Visual Geometry Transformers such as StreamVGGT enable strong online 3D perception, but their KV-cache grows unbounded over long streams, limiting practical deployment. We revisit bounded-memory streaming from the perspective of geometric support. Unlike language modeling, where useful information can often be compressed at the token level, geometry-driven reasoning depends on redundant and mutually compatible multi-view support. Under fixed budgets, token-level retention can fragment within-frame evidence, weaken the coherence of geometric support, and make stable long-horizon inference more difficult. Motivated by this observation, we propose FrameVGGT, a bounded explicit-memory framework that organizes each frame's incremental KV contribution as a coherent frame-level segment. FrameVGGT summarizes each segment with a lightweight key-space prototype and maintains a fixed-capacity memory of complementary segments, with an optional sparse anchor tier for difficult long-horizon intervals. Across long-sequence 3D reconstruction, video depth estimation, and camera pose estimation, FrameVGGT achieves favorable accuracy-memory trade-offs under bounded memory while maintaining more stable geometry over long streams.

Leave a Comment