difficulty-ceiling

#difficulty-ceiling

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

arXiv cs.CL ↗ · yesterday Cached

LoHoSearch is a new benchmark for evaluating long-horizon search agents, built from a knowledge graph of 7 million Wikipedia entities. It introduces questions with large search spaces and structural complexity to exceed human-authored difficulty ceilings, and shows that the best model achieves only 34.74% accuracy.

0 favorites 0 likes

difficulty-ceiling

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

Submit Feedback