Tag
AGORA is a new benchmark for evaluating large language models on archive-grounded reasoning tasks across workplace documents, comprising 362 questions over 9,664 real documents. The strongest model achieves only 59.4% accuracy, highlighting substantial room for improvement.