Tag
This paper introduces OfficeEval, a benchmark based on China's National Computer Rank Examination (NCRE) to evaluate LLM agents on complex Office automation tasks. Frontier models achieve at best 36.6% in single-turn and 68.8% with agentic systems, far below human-level performance.