cs.CV, cs.LG, cs.RO

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

arXiv:2604.03191v1 Announce Type: cross
Abstract: Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance–as it does in vision-language modeling. We show that this expecta…