cs.CV, cs.HC, cs.RO

A Multimodal Depth-Aware Method For Embodied Reference Understanding

arXiv:2510.08278v3 Announce Type: replace
Abstract: Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary…