cs.AI, cs.SE

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

arXiv:2604.02648v1 Announce Type: cross
Abstract: The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery consider…