WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
arXiv:2604.05818v2 Announce Type: replace
Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods sti…