- Provide.ai - Page 109

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

/ May 7, 2026

arXiv:2604.07634v2 Announce Type: replace
Abstract: Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Exi…

cs.CL, cs.CV

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

/ May 7, 2026

arXiv:2605.05045v1 Announce Type: new
Abstract: Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of vi…

cs.AI, cs.CV, cs.LG

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

/ May 7, 2026

arXiv:2605.05054v1 Announce Type: new
Abstract: Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM me…

cs.CV

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

/ May 7, 2026

arXiv:2604.20289v2 Announce Type: replace
Abstract: Real-time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressiv…

cs.CV

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

/ May 7, 2026

arXiv:2604.26752v2 Announce Type: replace
Abstract: We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on lang…

cs.CV

A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

/ May 7, 2026

arXiv:2605.05079v1 Announce Type: new
Abstract: Video sequence capturing through refractive dynamic media, such as a turbulent air or water surface, often suffer from severe geometric distortions and temporal instability. While recent advances address…

cs.AI, cs.CV, cs.LG

What Matters in Practical Learned Image Compression

/ May 7, 2026

arXiv:2605.05148v1 Announce Type: new
Abstract: One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite t…

cs.AI, cs.CV

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

/ May 7, 2026

arXiv:2605.05155v1 Announce Type: new
Abstract: As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling…

cs.AI, cs.CV

StableI2I: Spotting Unintended Changes in Image-to-Image Transition

/ May 7, 2026

arXiv:2605.04453v1 Announce Type: new
Abstract: In most real-world image-to-image (I2I) scenarios, existing evaluations primarily focus on instruction following and the perceptual quality or aesthetics of the generated images. However, they largely fa…

cs.CV

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

/ May 7, 2026

arXiv:2605.05163v1 Announce Type: new
Abstract: Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional proper…