scientific-reasoning

Tag

Cards List
#scientific-reasoning

NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models

arXiv cs.CL · 2d ago Cached

This paper introduces NSMQ Riddles, a novel benchmark using scientific and mathematical riddles from Ghana's National Science and Maths Quiz to evaluate Large Language Models, addressing the underrepresentation of Global South datasets in AI research.

0 favorites 0 likes
#scientific-reasoning

AI scientists produce results without reasoning scientifically [R]

Reddit r/MachineLearning · 2026-04-22

A study of 25,000 AI scientist trials finds the agents ignore evidence 68% of the time and rarely revise hypotheses, showing popular scaffolding fixes don’t instill true scientific reasoning.

0 favorites 0 likes
#scientific-reasoning

COMPOSITE-Stem

arXiv cs.CL · 2026-04-20 Cached

COMPOSITE-STEM introduces a benchmark of 70 expert-curated agentic tasks across physics, biology, chemistry, and mathematics, designed to evaluate AI agents on scientific workflows beyond saturated benchmarks. The top-performing model (Claude Opus 4.6) achieves only 21.4%, demonstrating significant capability gaps in scientific reasoning.

0 favorites 0 likes
#scientific-reasoning

AI scientists produce results without reasoning scientifically

Hugging Face Daily Papers · 2026-04-20 Cached

Large-scale study finds LLM-based scientific agents ignore evidence 68% of the time and rarely revise beliefs, showing they execute workflows but lack genuine scientific reasoning.

0 favorites 0 likes
#scientific-reasoning

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Hugging Face Daily Papers · 2026-04-07 Cached

MedConclusion introduces a large-scale benchmark of 5.7 million PubMed structured abstracts for evaluating LLMs on biomedical conclusion generation from structured scientific evidence. The study finds that conclusion writing is behaviorally distinct from summarization and that current automatic metrics cluster strong models closely together.

0 favorites 0 likes
#scientific-reasoning

Accelerating Mathematical and Scientific Discovery with Gemini Deep Think

Google DeepMind Blog · 2026-02-09 Cached

DeepMind announces Gemini Deep Think's ability to solve professional research problems in mathematics, physics, and computer science, highlighted by a new agent 'Aletheia' that iteratively verifies and revises solutions.

0 favorites 0 likes
#scientific-reasoning

Evaluating AI’s ability to perform scientific research tasks

OpenAI Blog · 2025-12-16 Cached

OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.

0 favorites 0 likes
#scientific-reasoning

Measuring AI’s capability to accelerate biological research

OpenAI Blog · 2025-12-16 Cached

OpenAI demonstrates GPT-5's capability to accelerate biological research by autonomously optimizing a molecular cloning protocol in collaboration with Red Queen Bio, achieving a 79-fold improvement in cloning efficiency through novel enzymatic mechanisms. The work showcases AI's potential to support experimental iteration and empirical validation in wet lab settings while highlighting biosecurity considerations.

0 favorites 0 likes
#scientific-reasoning

Advancing science and math with GPT-5.2

OpenAI Blog · 2025-12-11 Cached

OpenAI releases GPT-5.2, featuring GPT-5.2 Pro and GPT-5.2 Thinking variants optimized for scientific and mathematical work. The models achieve state-of-the-art performance on benchmarks like GPQA Diamond (93.2%) and FrontierMath (40.3%), demonstrating improved reasoning capabilities designed to accelerate scientific research across physics, chemistry, biology, and mathematics.

0 favorites 0 likes
← Back to home

Submit Feedback