Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
arXiv:2604.09781v1 Announce Type: new
Abstract: Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a targ…