Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
arXiv:2604.04579v2 Announce Type: replace
Abstract: Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrai…