Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
arXiv:2604.24964v1 Announce Type: cross
Abstract: Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site wor…