Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
arXiv:2604.14563v1 Announce Type: new
Abstract: Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these mode…