cs.CV, cs.RO

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

arXiv:2605.04678v1 Announce Type: cross
Abstract: Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs wit…