MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

arXiv:2512.17492v2 Announce Type: replace Abstract: Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, geographic coordinates, etc.). Current benchmarks have limited coverage across modalities, leading to specialized models that perform well in their respective domains, but do not fully take advantage of other geo-spatial modalities. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a benchmark composed of four modalities: 197k high-resolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18.557 distinct landmarks in the United States. The MMLandmarks dataset has a one-to-one landmark level correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We show that current specialized and off-the-shelf foundation models cannot be trivially used to solve this variety of geo-spatial tasks, illustrating a gap where multimodal datasets lead to broader geo-spatial understanding. We employ a simple CLIP-inspired baseline that reflects versatility and broad generalization when trained with MMLandmarks.

Leave a Comment