cs.CV

Boosting Visual Instruction Tuning with Self-Supervised Guidance

arXiv:2604.12966v1 Announce Type: new
Abstract: Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests th…