Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
arXiv:2605.10780v2 Announce Type: cross
Abstract: Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract f…