cs.CL, cs.CV, cs.LG

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

arXiv:2406.13621v2 Announce Type: replace
Abstract: Commonsense reasoning often requires both textual and visual knowledge, yet Large Language Models (LLMs) trained solely on text lack visual grounding (e.g., “what color is an emperor penguin’s belly?…