MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets
arXiv:2308.12067v3 Announce Type: replace-cross
Abstract: Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent stud…