- Provide.ai - Page 482

Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

/ April 16, 2026

arXiv:2604.14041v1 Announce Type: new
Abstract: Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks…

cs.CV

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

/ April 16, 2026

arXiv:2407.08101v5 Announce Type: replace
Abstract: Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the…

cs.AI, cs.CV

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

/ April 16, 2026

arXiv:2604.13304v1 Announce Type: cross
Abstract: Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-inte…

cs.CV

Bias at the End of the Score

/ April 16, 2026

arXiv:2604.13305v1 Announce Type: new
Abstract: Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of…

cs.AI, cs.CL

From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning

/ April 16, 2026

arXiv:2604.13398v1 Announce Type: cross
Abstract: While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as “black boxes,” lacking the explicit reasoning capabilities ch…

cs.CL, cs.CV

When ‘YES’ Meets ‘BUT’: Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

/ April 16, 2026

arXiv:2503.23137v2 Announce Type: replace-cross
Abstract: Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). T…

cs.CV

Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

/ April 16, 2026

arXiv:2604.13326v1 Announce Type: new
Abstract: The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to tra…

cs.AI, cs.CL, cs.CV

MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

/ April 16, 2026

arXiv:2604.13418v1 Announce Type: cross
Abstract: Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence R…

cs.AI, cs.CV

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

/ April 16, 2026

arXiv:2505.19662v3 Announce Type: replace
Abstract: This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazar…

cs.CL

CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

/ April 16, 2026

arXiv:2604.13452v1 Announce Type: new
Abstract: Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produc…