Tag
Superpowers 6 is released, using Fable 5 to run 25 autonomous experiments, improving build speed by 50% and reducing token costs by 60%, with detailed records of the experimental process and lessons from failures.
This paper presents TrafficSci, an agentic AI system that automates the discovery of universal traffic laws across cities through iterative workflows, successfully rediscovering established laws and identifying a new temporal memory scale in urban driving behavior.
A tweet speculates that OpenAI's upcoming AI research intern (September) feels like early AGI, and predicts a fully autonomous AI researcher by 2027-2028, which could be the first ASI.
Demonstrates building a full LLM using a single prompt to an AI coding agent (Claude Code/Codex) and installing an autonomous AI research skill by a DeepSeek researcher, covering architecture, failure modes, and unattended operation.
A DeepSeek researcher open-sourced AutoResearch, an autonomous framework that can plan, execute, and debug RL experiments on the DeepSeek 285B model without human intervention, accompanied by a self-play survey paper.
GPT-5.4, in collaboration with Molecule.one's Maria AI platform, autonomously drove a medicinal chemistry project from literature review to validated experimental result, proposing an unexpected improvement to a widely used reaction in drug discovery.
Deli AutoResearch SKILL is open-sourced, an autonomous framework that automates GPU experiments and RL pipelines, with a companion survey paper on Self-play.
Sakana AI launches its first commercial product, Sakana Marlin, an autonomous research assistant that completes strategy work in hours by generating structured slides and detailed reports.
THU Team Eureka open-sources EurekAgent, an autonomous research system built with Claude Code that achieves state-of-the-art results on math, kernel engineering, and ML tasks through environment engineering.
A paper introducing Arbor, an AI framework that enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains.
This paper proposes a method for autonomous research agents using hypothesis-tree refinement to generate and test hypotheses, aiming toward generalist scientific discovery.
Arbor is an AI framework for autonomous scientific research that uses a coordinator, executors, and a persistent hypothesis tree to iteratively improve research outcomes across multiple domains, achieving strong results on six real research tasks.
ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, using expert-curated rubrics. Current systems score poorly, highlighting challenges in achieving reliable autonomous scientific discovery.
This paper presents CatDT, a self-evolving multi-agent digital twin that autonomously predicts heterogeneous catalyst properties from bulk crystal and reaction description, achieving experimental accuracy across seven benchmarks and discovering non-precious catalyst candidates for propane dehydrogenation.
AutoLab is a new benchmark evaluating 17 frontier models on 36 expert-curated long-horizon tasks (system optimization, model development, CUDA kernels, puzzles), finding that persistence—not initial attempt quality—is the dominant predictor of success. Claude-opus-4.6 led all categories, while most other models terminated prematurely or exhausted budgets with minimal progress.
AutoMedBench is a workflow-aware benchmark for autonomous medical-AI research, evaluating agents across five stages on diverse medical imaging tasks. Stage-level scoring reveals validation as the weakest stage, highlighting the need for reliable verification in agentic workflows.
ResearchClawBench is a benchmark for evaluating end-to-end autonomous scientific research across 40 tasks from 10 domains, revealing that current AI agents and LLMs achieve low re-discovery accuracy, with Claude Code averaging 21.5 and Claude-Opus-4.7 averaging 20.7 out of a possible score.
ScientistOne introduces Chain-of-Evidence, a verifiability framework for autonomous research agents that ensures every claim is traceable to evidence, achieving zero hallucinated references, perfect score verification, and the highest method-code alignment across 75 papers while matching or exceeding human expert performance on frontier research tasks.
Dexter is an open-source autonomous financial research agent that analyzes stocks and builds investment theses using real-time data, task planning, and self-reflection.
Karpathy open-sourced an experimental project, autoresearch, that lets an AI Agent automatically complete the research loop for small-scale LLM training: modify code, run experiments, evaluate results, and iterate. Humans only need to write the research plan and constraints.