cs.AI, cs.CL, cs.LG

Interactive Benchmarks

arXiv:2603.04737v3 Announce Type: replace
Abstract: Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subj…

cs.AI

MMSkills: Towards Multimodal Skills for General Visual Agents

arXiv:2605.13527v1 Announce Type: new
Abstract: Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines…

cs.AI

How to Interpret Agent Behavior

arXiv:2605.13625v1 Announce Type: new
Abstract: Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing…

Scroll to Top