- Provide.ai - Page 96

NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

/ May 11, 2026

arXiv:2605.07051v1 Announce Type: new
Abstract: Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evalua…

cs.CV

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

/ May 11, 2026

arXiv:2605.07919v1 Announce Type: new
Abstract: Medical vision–language models (VLMs) are usually evaluated on intact image–question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis f…

cs.AI, cs.IR

LARAG: Link-Aware Retrieval Strategy for RAG Systems in Hyperlinked Technical Documentation

/ May 11, 2026

arXiv:2605.07517v1 Announce Type: cross
Abstract: Retrieval-Augmented Generation (RAG) enhances the factual grounding of Large Language Models by conditioning their outputs on external documents. However, standard embedding-based retrievers treat natu…

cs.CV, cs.RO

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

/ May 11, 2026

arXiv:2605.08084v1 Announce Type: new
Abstract: The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D an…

cs.AI, cs.CL

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

/ May 11, 2026

arXiv:2605.07053v1 Announce Type: new
Abstract: Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-le…

cs.LG

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

/ May 11, 2026

arXiv:2605.07039v1 Announce Type: new
Abstract: Large language models have become drivers of evolutionary search, but most systems rely on a fixed, prompt-elicited policy to sample next candidates. This limits adaptation in practical engineering and r…

cs.AI, cs.LG

EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning

/ May 11, 2026

arXiv:2604.16579v2 Announce Type: replace-cross
Abstract: Automated multimodal depression estimation in unconstrained environments is inherently challenged by naturalistic noise and complex behavioral variability. Prevailing deterministic methods, how…

cs.CV, cs.RO

3D Generation for Embodied AI and Robotic Simulation: A Survey

/ May 11, 2026

arXiv:2604.26509v3 Announce Type: replace
Abstract: Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has …

cs.AI, physics.med-ph

Overcoming data scarcity through multi-center federated learning for organs-at-risk segmentation in pediatric upper abdominal radiotherapy

/ May 11, 2026

arXiv:2605.06820v1 Announce Type: cross
Abstract: Deep learning-based organs/structures-at-risk(OARs) auto-contouring models can improve radiotherapy workflows, but models trained on adult data often underperform in pediatric patients. Developing robu…

cs.AI, cs.LG

Exact Is Easier: Credit Assignment for Cooperative LLM Agents

/ May 11, 2026

arXiv:2603.06859v2 Announce Type: replace-cross
Abstract: Removing an agent from a cooperative team to measure its contribution seems natural, yet in multi-agent LLM systems this evaluation distorts the result it claims to measure. This failure is not…