ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
arXiv:2604.05172v2 Announce Type: replace
Abstract: Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potenti…