EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA
arXiv:2510.06371v2 Announce Type: replace
Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech, and 20K hours of voice-cloned speech, from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries, covering Modern Standard Arabic (MSA) as well as dialectal Arabic. It is designed to evaluate models beyond object recognition, targeting pragmatic, commonsense, and culturally grounded reasoning in real-world scenarios. We benchmark four closed-source models, three open-source models, and one fine-tuned model on OASIS. The framework and dataset will be made publicly available to the community. https://huggingface.co/datasets/QCRI/OASIS