From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
arXiv:2603.26839v2 Announce Type: replace-cross
Abstract: How do multimodal models solve visual spatial tasks — through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally g…