Tag
SaaS-Bench is a new benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 long-horizon tasks for evaluating computer-using agents. Experiments show that even the strongest models complete fewer than 4% of tasks end-to-end, highlighting significant limitations in current agent capabilities.