cs.CV

Latent Denoising Improves Visual Alignment in Large Multimodal Models

arXiv:2604.21343v1 Announce Type: new
Abstract: Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak interna…