- Provide.ai - Page 355

ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

/ April 14, 2026

arXiv:2604.03765v2 Announce Type: replace
Abstract: Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from …

cs.CG, cs.CV

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

/ April 14, 2026

arXiv:2604.11331v1 Announce Type: new
Abstract: 3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-leve…

cs.CV

LoGo-MR: Screening Breast MRI for Cancer Risk Prediction by Efficient Omni-Slice Modeling

/ April 14, 2026

arXiv:2604.11348v1 Announce Type: new
Abstract: Efficient and explainable breast cancer (BC) risk prediction is critical for large-scale population-based screening. Breast MRI provides functional information for personalized risk assessment. Yet effec…

cs.CV

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

/ April 14, 2026

arXiv:2604.07209v2 Announce Type: replace
Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial…

cs.CV

RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

/ April 14, 2026

arXiv:2604.07765v2 Announce Type: replace
Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instruction…

cs.CV

ParseBench: A Document Parsing Benchmark for AI Agents

/ April 14, 2026

arXiv:2604.08538v3 Announce Type: replace
Abstract: AI agents are changing the requirements for document parsing. What matters is semantic correctness: parsed output must preserve the structure and meaning needed for autonomous decisions, including co…

cs.CV

WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance

/ April 14, 2026

arXiv:2604.10797v1 Announce Type: new
Abstract: We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologi…

cs.CV

ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

/ April 14, 2026

arXiv:2604.11389v1 Announce Type: new
Abstract: Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view iden…

cs.CV

M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation

/ April 14, 2026

arXiv:2303.10894v3 Announce Type: replace
Abstract: Accurate medical image segmentation is critical for early medical diagnosis. Most existing methods are based on U-shape structure and use element-wise addition or concatenation to fuse different leve…

cs.CV

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

/ April 14, 2026

arXiv:2407.08101v4 Announce Type: replace
Abstract: Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the…