cs.CV

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

arXiv:2605.02641v1 Announce Type: new
Abstract: We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model’s generation capab…

cs.AI, cs.CV

Perceptual Flow Network for Visually Grounded Reasoning

arXiv:2605.02730v1 Announce Type: cross
Abstract: Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To m…

Scroll to Top