Tag
This paper presents QuestBench, a benchmark built by students to evaluate deep research systems across humanities and social science domains. Results show that even advanced systems like GPT-5.5 pass only 57.58% of questions, highlighting failures in trustworthiness.