- Provide.ai - Page 494

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

/ April 3, 2026

arXiv:2601.10611v4 Announce Type: replace
Abstract: Today’s strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not di…

cs.CV

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

/ April 3, 2026

arXiv:2601.16515v2 Announce Type: replace
Abstract: Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. V…

cs.CV

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

/ April 3, 2026

arXiv:2602.23205v2 Announce Type: replace
Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing c…

cs.CV

OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

/ April 3, 2026

arXiv:2603.24458v2 Announce Type: replace
Abstract: While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavil…

cs.AI, cs.IR

UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems

/ April 3, 2026

arXiv:2604.00590v2 Announce Type: cross
Abstract: In recent years, the scaling laws of recommendation models have attracted increasing attention, which govern the relationship between performance and parameters/FLOPs of recommenders. Currently, there …

cs.CV

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

/ April 3, 2026

arXiv:2604.02048v1 Announce Type: new
Abstract: Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregati…

cs.CV

True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines

/ April 3, 2026

arXiv:2604.02055v1 Announce Type: new
Abstract: Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photogr…

cs.CV

Scaling Video Pretraining for Surgical Foundation Models

/ April 3, 2026

arXiv:2603.29966v2 Announce Type: replace
Abstract: Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent…

cs.CV

PLUME: Latent Reasoning Based Universal Multimodal Embedding

/ April 3, 2026

arXiv:2604.02073v1 Announce Type: new
Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales be…

cs.AI, cs.CV, cs.SE

GPA: Learning GUI Process Automation from Demonstrations

/ April 3, 2026

arXiv:2604.01676v1 Announce Type: new
Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of …