- Provide.ai - Page 37

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

/ April 21, 2026

arXiv:2602.06221v2 Announce Type: replace
Abstract: Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common…

cs.CV, cs.LG

Vision Language Models are Biased

/ April 21, 2026

arXiv:2505.23941v4 Announce Type: replace
Abstract: Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answer…

cs.AI, cs.RO

Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots

/ April 21, 2026

arXiv:2505.19237v2 Announce Type: replace-cross
Abstract: Self-recognition — the ability to maintain an internal representation of one’s own body within the environment — underpins intelligent, autonomous behavior. As a foundational component of the…

cs.CL

CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams

/ April 21, 2026

arXiv:2604.16665v1 Announce Type: new
Abstract: Urgent blood donation seeking posts and messages on social media often go unnoticed due to the overwhelming volume of daily communications. Traditional app-based systems, reliant on manual input, struggl…

cs.AI, cs.CL, cs.DB

PersonalHomeBench: Evaluating Agents in Personalized Smart Homes

/ April 21, 2026

arXiv:2604.16813v1 Announce Type: cross
Abstract: Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we intro…

cs.AI, cs.CL

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

/ April 21, 2026

arXiv:2510.06133v2 Announce Type: replace
Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the…

cs.AI, cs.CL

WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

/ April 21, 2026

arXiv:2510.05336v2 Announce Type: replace
Abstract: Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events…

cs.CL, cs.HC

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

/ April 21, 2026

arXiv:2506.05606v5 Announce Type: replace
Abstract: Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating “believable” human behaviors, evaluating thei…

cs.AI, cs.LG, q-bio.QM

A Systematic Survey and Benchmark of Deep Learning for Molecular Property Prediction in the Foundation Model Era

/ April 21, 2026

arXiv:2604.16586v1 Announce Type: new
Abstract: Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complem…

cs.CV, cs.HC

Deep Learning for Virtual Reality User Identification: A Benchmark

/ April 21, 2026

arXiv:2604.16341v1 Announce Type: cross
Abstract: Virtual Reality (VR) applications require robust user identification systems to ensure secure access to equipment and protect worker identities. Motion tracking data from VR headsets and controllers ha…