Tag
This paper introduces NSMQ Riddles, a novel benchmark using scientific and mathematical riddles from Ghana's National Science and Maths Quiz to evaluate Large Language Models, addressing the underrepresentation of Global South datasets in AI research.
A study of 25,000 AI scientist trials finds the agents ignore evidence 68% of the time and rarely revise hypotheses, showing popular scaffolding fixes don’t instill true scientific reasoning.
COMPOSITE-STEM introduces a benchmark of 70 expert-curated agentic tasks across physics, biology, chemistry, and mathematics, designed to evaluate AI agents on scientific workflows beyond saturated benchmarks. The top-performing model (Claude Opus 4.6) achieves only 21.4%, demonstrating significant capability gaps in scientific reasoning.
Large-scale study finds LLM-based scientific agents ignore evidence 68% of the time and rarely revise beliefs, showing they execute workflows but lack genuine scientific reasoning.
MedConclusion introduces a large-scale benchmark of 5.7 million PubMed structured abstracts for evaluating LLMs on biomedical conclusion generation from structured scientific evidence. The study finds that conclusion writing is behaviorally distinct from summarization and that current automatic metrics cluster strong models closely together.
DeepMind announces Gemini Deep Think's ability to solve professional research problems in mathematics, physics, and computer science, highlighted by a new agent 'Aletheia' that iteratively verifies and revises solutions.
OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.
OpenAI demonstrates GPT-5's capability to accelerate biological research by autonomously optimizing a molecular cloning protocol in collaboration with Red Queen Bio, achieving a 79-fold improvement in cloning efficiency through novel enzymatic mechanisms. The work showcases AI's potential to support experimental iteration and empirical validation in wet lab settings while highlighting biosecurity considerations.
OpenAI releases GPT-5.2, featuring GPT-5.2 Pro and GPT-5.2 Thinking variants optimized for scientific and mathematical work. The models achieve state-of-the-art performance on benchmarks like GPQA Diamond (93.2%) and FrontierMath (40.3%), demonstrating improved reasoning capabilities designed to accelerate scientific research across physics, chemistry, biology, and mathematics.