Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
arXiv:2603.26211v1 Announce Type: new
Abstract: Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language model…