cs.AI, cs.CL, cs.CV, cs.LG

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

arXiv:2308.12067v3 Announce Type: replace-cross
Abstract: Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent stud…