- Provide.ai - Page 56

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

/ May 5, 2026

arXiv:2605.01417v1 Announce Type: cross
Abstract: Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suite…

cs.CL

Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language models

/ May 5, 2026

arXiv:2605.01451v1 Announce Type: new
Abstract: Large language models (LLMs) are rapidly being integrated into high-stakes public safety systems, including emergency call triage and dispatch decision support, yet their demographic fairness in this con…

cs.AI, cs.CL

Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation

/ May 5, 2026

arXiv:2504.02293v3 Announce Type: replace-cross
Abstract: Gloss is a written approximation that bridges Sign Language (SL) and its corresponding spoken language. Despite a deaf and hard-of-hearing population of at least 3 million in Bangladesh, Bangla…

cs.CL

The grip of grammar on meaning uncertainty: cross-linguistic evidence, neural correlates, and clinical relevance

/ May 5, 2026

arXiv:2605.01537v1 Announce Type: new
Abstract: Isolated word meanings are inherently uncertain. This uncertainty reduces when they are combined and anchored in context. We propose that grammar compresses meaning uncertainty cross-linguistically, whic…

cs.AI, cs.CL

Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models

/ May 5, 2026

arXiv:2605.01605v1 Announce Type: cross
Abstract: Large language models are sensitive to minor prompt perturbations, yet existing robustness methods usually enforce consistency at the whole-sequence level. This holistic view can hide an important fail…

cs.AI, cs.CL

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

/ May 5, 2026

arXiv:2605.01630v1 Announce Type: cross
Abstract: Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensiti…

cs.AI, cs.CL

OpenAI GPT-5 System Card

/ May 5, 2026

arXiv:2601.03267v2 Announce Type: replace-cross
Abstract: This is the system card published alongside the OpenAI GPT-5 launch, August 2025.
GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model f…

cs.AI, cs.CL, cs.IR

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

/ May 5, 2026

arXiv:2601.19827v3 Announce Type: replace-cross
Abstract: Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static …

cs.CL

Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality

/ May 5, 2026

arXiv:2605.01749v1 Announce Type: new
Abstract: Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approach…

cs.AI, cs.CL

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

/ May 5, 2026

arXiv:2604.27924v2 Announce Type: replace-cross
Abstract: Peer review is a multi-stage process involving reviews, rebuttals, meta-reviews, final decisions, and subsequent manuscript revisions. Recent advances in large language models (LLMs) have motiv…