CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
arXiv:2603.21077v2 Announce Type: replace
Abstract: Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or…