cs.CV

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

arXiv:2601.06993v2 Announce Type: replace
Abstract: Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle vis…