cs.CV

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

arXiv:2605.12549v1 Announce Type: new
Abstract: Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional comput…