Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
arXiv:2602.05523v2 Announce Type: replace-cross
Abstract: Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agen…