Tag
This paper introduces SciConBench, a large-scale benchmark with 9.11K questions and expert-written conclusions for evaluating AI agents' ability to synthesize scientific conclusions from open-domain evidence. The study finds that even the best agent achieves only a factual F1 of 0.337 in clean-room settings, highlighting that reliable synthesis remains an open challenge.
This paper evaluates whether bibliometric structure improves LLM-assisted scientific literature synthesis by comparing six pipelines for generating cluster descriptions. Results show LLMs perform best in a hybrid workflow where bibliometric algorithms define clusters and LLMs generate readable descriptions.