- Provide.ai - Page 21

AdaTooler-V: Adaptive Tool-Use for Images and Videos

/ April 29, 2026

arXiv:2512.16918v3 Announce Type: replace
Abstract: Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models…

cs.RO

RISE: Self-Improving Robot Policy with Compositional World Model

/ April 29, 2026

arXiv:2602.11075v2 Announce Type: replace
Abstract: Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviatio…

cs.CV

OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

/ April 29, 2026

arXiv:2604.00270v3 Announce Type: replace
Abstract: Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) …

cs.RO

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

/ April 29, 2026

arXiv:2604.25788v1 Announce Type: new
Abstract: Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a …

cs.CV, cs.RO

BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autonomous Driving

/ April 29, 2026

arXiv:2408.16322v4 Announce Type: replace-cross
Abstract: Current research in semantic bird’s-eye view segmentation for autonomous driving focuses solely on optimizing neural network models using a single dataset, typically nuScenes. This practice lea…

cs.AI, cs.CV

Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

/ April 29, 2026

arXiv:2604.07802v3 Announce Type: replace
Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current met…

cs.CV, cs.RO

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

/ April 29, 2026

arXiv:2509.10813v4 Announce Type: replace-cross
Abstract: The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer f…

cs.RO, eess.AS

ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3D

/ April 29, 2026

arXiv:2604.25387v1 Announce Type: cross
Abstract: Direction-of-arrival (DOA) estimation is an important task in microphone array processing and many downstream applications. The steered response power with phase transform (SRP-PHAT) method has been wi…

cs.CV

NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge Report

/ April 29, 2026

arXiv:2604.17070v2 Announce Type: replace
Abstract: This report presents the NTIRE 2026 Rip Current Detection and Segmentation (RipDetSeg) Challenge, which targets automatic rip current understanding in images. Rip currents are hazardous nearshore flo…

cs.AI, cs.CV, cs.RO

From Scene to Object: Text-Guided Dual-Gaze Prediction

/ April 29, 2026

arXiv:2604.20191v2 Announce Type: replace-cross
Abstract: Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level ann…