Research on Vision-Language Question Answering Models for Industrial Robots
arXiv:2605.01483v1 Announce Type: cross
Abstract: A hierarchical cross-modal fusion model is proposed for vision-language question answering (VLQA) in industrial robotics, targeting the challenges of semantic ambiguity, complex environmental layouts, …