LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
arXiv:2604.11689v1 Announce Type: cross
Abstract: While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale huma…