cs.AI, cs.CL, cs.CR

Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

arXiv:2604.17159v1 Announce Type: cross
Abstract: We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the…