LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
arXiv:2501.05067v3 Announce Type: replace
Abstract: In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabli…