Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training
arXiv:2604.27932v1 Announce Type: new
Abstract: The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data b…