HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
arXiv:2604.09408v3 Announce Type: replace
Abstract: Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when t…