VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
arXiv:2505.20291v5 Announce Type: replace-cross
Abstract: Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and v…