Tag
LoHoSearch is a new benchmark for evaluating long-horizon search agents, built from a knowledge graph of 7 million Wikipedia entities. It introduces questions with large search spaces and structural complexity to exceed human-authored difficulty ceilings, and shows that the best model achieves only 34.74% accuracy.