FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometr
arXiv:2505.14062v3 Announce Type: replace
Abstract: Vision Mamba offers linear complexity for long visual sequences, yet its performance depends critically on how a two-dimensional patch grid is serialized into a one-dimensional state-space recurrence. Raster-style scans disrupt spatial continuity, and the mismatch between 2D locality and 1D state propagation becomes increasingly severe when the inference resolution grows beyond the training grid. This paper presents FractalMamba++, a resolution-scalable vision backbone organized around a single geometric principle: the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded. First, Hilbert-curve-based Fractal Serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions. Second, the Fractal Hierarchy Skip Connection (FHSC) derives a compact set of deterministic state-injection routes from Hilbert recursion levels, mitigating long-sequence information fading without runtime search, hand-derived gradients, or dedicated CUDA kernels. Third, Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) combines normalized 2D coordinates with a fractal hierarchy level so that feature interactions depend on actual spatial proximity and recursive structural role rather than serialized 1D distance. Extensive experiments on ImageNet-1K classification, COCO detection and instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ remote sensing change detection show that FractalMamba++ improves over existing Mamba-based vision backbones, especially under high-resolution inputs.