Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
arXiv:2605.02757v1 Announce Type: cross
Abstract: Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substan…