difficulty-ceiling

Tag

Cards List
#difficulty-ceiling

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

arXiv cs.CL · yesterday Cached

LoHoSearch is a new benchmark for evaluating long-horizon search agents, built from a knowledge graph of 7 million Wikipedia entities. It introduces questions with large search spaces and structural complexity to exceed human-authored difficulty ceilings, and shows that the best model achieves only 34.74% accuracy.

0 favorites 0 likes
← Back to home

Submit Feedback