A Multimodal Depth-Aware Method For Embodied Reference Understanding
arXiv:2510.08278v3 Announce Type: replace
Abstract: Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary…