@yibie: 3 new additions this round: 1. Auto-Quant: Apply Karpathy autoresearch to FreqTrade cryptocurrency strategy backtesting, evolving multi-strategy combinations across 5 trading pairs through a keep/discard loop. 2. Universal…

X AI KOLs Timeline Tools

Summary

yibie shared three new entries from the awesome-autoresearch list, covering automated quantitative trading, universal skill optimization, and a Claude Code plugin.

3 new additions this round: 1. Auto-Quant: Apply Karpathy autoresearch to FreqTrade cryptocurrency strategy backtesting, evolving multi-strategy combinations across 5 trading pairs through a keep/discard loop. 2. Universal Skill: Transform autoresearch into a universal prompt optimization skill for Claude Code/Cursor, automatically scanning codebases, defining evaluation criteria, and retaining high-scoring variants. 3. Claude Code Plugin: Generalize autoresearch into a dual-skill system of discovery + optimization, running autonomous experimental loops on any quantifiable metric. http://github.com/yibie/awesome-autoresearch… Currently 428 entries
Original Article
View Cached Full Text

Cached at: 06/03/26, 05:45 AM

3 new additions this round:

  1. Auto-Quant: Applies Karpathy-style autoresearch to FreqTrade cryptocurrency strategy backtesting, evolving a multi-strategy portfolio across 5 trading pairs through keep/discard loops.
  2. Universal Skill: Reframes autoresearch as a general-purpose prompt optimization skill for Claude Code/Cursor, automatically scanning codebases, defining evaluation criteria, and retaining high-scoring variants.
  3. Claude Code Plugin: Generalizes autoresearch into a dual-skill system (discovery + optimization), running autonomous experimentation loops on any quantifiable metric.

http://github.com/yibie/awesome-autoresearch… Currently 428 entries


yibie/awesome-autoresearch

Source: https://github.com/yibie/awesome-autoresearch

awesome-autoresearch

A curated awesome list of public autoresearch use cases across industries.

This README is the homepage aggregate of the current category files, so the latest accepted entries are visible here without drilling into subpages.

The repository treats all categories equally — each entry lives in exactly one category, chosen by its direct autoresearch application domain. A dedicated Related Practices / Discussions category captures credible public practice signals — X threads, Reddit discussions, and interviews — that describe real autoresearch usage even when no strong standalone case page exists yet.

Why this list

Most discussions about autoresearch are still scattered, vague, or overly tool-centric. This list is designed to answer two practical questions quickly:

  • Where has autoresearch already been used in real workflows?
  • Which patterns can transfer across industries?

This is not a comprehensive database. It is a high-signal, fast-scanning field guide.

Inclusion criteria

An entry should meet all of the following:

  • The source is public and citable.
  • The example is directly related to autoresearch, not just a generic research or monitoring agent.
  • The source explicitly mentions autoresearch, cites Karpathy’s autoresearch, or clearly shows a modify → verify → keep/discard → repeat loop.
  • The summary explains the scenario, method, and value in one sentence.

We do not include:

  • Generic research agents, monitoring agents, or multi-agent systems with no explicit autoresearch loop.
  • Pure theory or opinion without a concrete practice.
  • Generic AI commentary with no autoresearch workflow.
  • Long write-ups inside the list itself.
  • Sources that are private, inaccessible, or too vague to classify.

Current coverage

Open categories still being seeded

Each entry lives in exactly one category. When a project could fit multiple categories, we choose the one closest to its direct application domain.

Browse by category

Full list

Scientific Research

Source file: categories/scientific-research.md

  • AutoResearchClaw (https://github.com/aiming-lab/AutoResearchClaw) - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
  • Sibyl Research System (https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem) - Scientific research: builds a fully autonomous AI scientist on Claude Code with inner research-iteration loops and outer self-evolution across projects.
  • autoresearch-rl (https://github.com/vivekvkashyap/autoresearch-rl) - RL research: applies the autoresearch pattern to RL post-training by iterating on one training config, running fixed-time experiments, and keeping only eval improvements.
  • autoresearch-robotics (https://github.com/jellyheadandrew/autoresearch-robotics) - Robotics research: adapts Karpathy-style autoresearch to MuJoCo and Gymnasium robotics tasks by editing one training file, evaluating fixed-budget runs, and using simulator renderings plus vision feedback to keep only better policies.
  • autoslam (https://github.com/pglira/autoslam) - LiDAR SLAM research: applies an autoresearch-style autonomous loop to LiDAR SLAM method design and evaluation on the KITTI odometry benchmark, accumulating a leaderboard over weeks through iterative keep/discard experiments.
  • Tinker-Explorer (https://github.com/mohammed840/autoresearch-explorer) - Evidence-retrieval research: adapts the autoresearch pattern to GRPO document exploration, comparing reward designs and keeping only retrieval policies that answer multihop questions more accurately under a token budget.
  • Autoresearch on an old research idea (https://ykumar.me/blog/eclip-autoresearch/) - Multimodal retrieval research: applies Claude Code autoresearch to an old eCLIP idea, running 42 fixed-budget experiments with commit/revert decisions and cutting mean rank from 344.68 to 157.43.
  • autoresearch-at-home (https://github.com/mutable-state-inc/autoresearch-at-home) - Distributed ML research: coordinates a SETI@home-style swarm of agents that claim experiments, share full train.py results through Ensue, and collectively drive down val_bpb across different GPUs.
  • autoresearch-paper-benchmark (https://github.com/roth-andreas/autoresearch-paper-benchmark) - Graph ML research: runs paper-driven campaigns on a fixed Peptides-func benchmark by editing train.py, logging 300-second experiments, and testing only the best validation-AP model at campaign end.
  • autoresearch-cifar10 (https://github.com/GuillaumeErhard/autoresearch-cifar10) - Vision research: applies autoresearch to CIFAR-10 ResNet training on a 3090, iterating under fixed time budgets and keeping changes that lift accuracy beyond a 91.89% baseline.
  • autoresearch-yolo (https://github.com/Tuesdayyy/autoresearch-yolo) - Vision research: adapts Karpathy’s one-file keep/revert loop to YOLOv8 on COCO128, editing only train.py, running 10-minute experiments, and raising mAP50-95 from 0.3820 to 0.8303 in the published results ledger.
  • AutoResearch-GenPose (https://github.com/PeleWang/AutoResearch-GenPose) - Vision research: adapts autoresearch to CIFAR-10 UNet denoising by editing one training file, running fixed 5-minute experiments, and keeping only val_psnr improvements.
  • MLP-AutoResearch (https://github.com/HuangShengZeBlueSky/MLP_AutoResearch) - MNIST training research: ports Karpathy’s single-file loop to an MLP classifier, fixing 20-epoch runs and greedy keep/revert decisions that raised handwritten-digit accuracy from 0.9809 to 0.9836.
  • autoresearch-medimage (https://github.com/mattlungrenmd/autoresearch-medimage) - Medical imaging research: adapts Karpathy’s prepare.py + train.py + results.tsv loop to 2D imaging tasks, using short-budget candidate discovery and staged follow-up validation to surface stronger ChestXray14 models.
  • Agentic MR sequence development (https://arxiv.org/abs/2604.13282) - MRI sequence research: uses Agent4MR with physics-aware validation to let agents generate, refine, and autoresearch PyPulseq sequences for a fluid-suppressed spin-echo EPI challenge across model generations.
  • autocircuit (https://github.com/qelloman/autocircuit) - Analog circuit optimization: adapts Karpathy’s autoresearch to a SKY130 two-stage op-amp, editing optimize.py, running ngspice, and keeping only parameter changes that expand the GBW-versus-power Pareto front under phase-margin constraints.
  • Photonic Device Auto-Design Agent (https://github.com/flexcompute/autophotonicdesign) - Photonic design research: applies a Karpathy-inspired loop to one design.py, running layout previews, fabrication DRC, FDTD simulations, and keep/discard decisions to optimize devices such as splitters, tapers, and crossings.
  • fe-autoresearch (https://github.com/ezemriv/fe-autoresearch) - Tabular ML research: applies the autoresearch loop to LightGBM feature engineering on the UCI Bank Marketing dataset by editing one engineer_features() target, training against fixed AUC metrics, and keeping only improvements.
  • AutoFeaTune (https://github.com/PranavBedi11/AutoFeaTune) - Tabular ML research: edits only features.py, trains XGBoost under 5-fold CV composite scoring, and uses a git ratchet plus graph memory to keep or revert feature transforms over repeated autoresearch runs.
  • Paper Lantern improves Autoresearch (https://www.paperlantern.ai/blog/auto-research-case-study) - ML research augmentation: connects a 2M-paper MCP server to autoresearch, letting the agent cite 100 papers across 100 experiments and reach a 3.2% lower 2-hour validation loss than the same run without paper access.
  • Subtractive Search in a Mature Tabular Pipeline (https://github.com/jhamandeep/autoresearch-tabular-case-study) - Tabular ML research: applies Karpathy’s autoresearch to a churn-prediction XGBoost pipeline, running 116 autonomous experiments and lifting subsample AUC from 0.902892 to 0.916721 largely by removing noisy target-encoded features.
  • autoresearch-connect4 (https://github.com/alessoh/autoresearch-connect4) - Game AI research: adapts Karpathy’s three-file autoresearch loop to Connect Four by editing train.py, training 5-minute self-play runs, and keeping only changes that improve weighted win rate against fixed opponents.
  • autoresearch-tabular (https://github.com/saikatkumardey/autoresearch-tabular) - Tabular ML research: adapts Karpathy’s three-file loop to the Adult Income benchmark by editing only train.py, running fixed 2-minute experiments, and keeping only val_auc improvements.
  • ocr-autoresearch (https://github.com/zhd36/ocr-autoresearch) - OCR research: adapts autoresearch to ICDAR2015 scene-text recognition by editing one train.py, running fixed 5-minute CRNN+CTC experiments, and keeping only lower validation character error rates.
  • xgboost-autoresearch (https://github.com/szilard/xgboost-autoresearch) - XGBoost model optimization: provides a reusable template and a concrete end-to-end example that autonomously iterates on feature engineering and XGBoost hyperparameters for tabular datasets, actively searching the web for domain knowledge and ML best practices to inform each experiment, and delivering sustained AUC improvements across successive iterations.
  • Tennis XGBoost Autoresearch (https://github.com/buildoak/tennis-xgboost-autoresearch) - Sports analytics research: applies a Karpathy-style keep/revert loop to a 245K-match tennis XGBoost pipeline, then hardens the evaluator after the agent learned to game mutable ROC-AUC scoring.
  • Bio-Autoresearch (https://github.com/monk1337/Bio-Autoresearch) - Drug discovery research: applies a Karpathy-style autoresearch loop to rare-disease drug repurposing on PrimeKG, running 15 GPU experiments with keep/revert decisions and lifting held-out per-disease AUPRC from 0.284 to 0.761.
  • autoresearch-quantum (https://github.com/saymrwulf/autoresearch-quantum) - Quantum research: runs incumbent/challenger autoresearch ratchets for encoded magic-state experiments, screens candidates on cheap noisy simulations, and promotes only justified challengers to expensive backends while logging transferable lessons.
  • kaggle-autoresearch (https://github.com/aplassard/kaggle-autoresearch) - Tabular ML research: adapts Karpathy-style autoresearch to Kaggle competitions such as Titanic, House Prices, and Store Sales by iterating on feature and model code, logging approved baselines, and accepting only cross-validation improvements over fixed thresholds.
  • MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - AI-lab research: describes an internal research agent that automates 30%-50% of RL workflow and a 100+ round keep/revert scaffold-optimization loop that improved internal evaluation scores by 30%.
  • autoresearch-macro (https://github.com/wegard/autoresearch-macro) - Macroeconomic forecasting research: runs LLM-guided outer-loop search over Chronos-2 covariates, transforms, and fine-tuning settings, keeping only validation-era forecast improvements across pseudo-real-time Norway, Canada, and Sweden benchmarks.
  • autoresearch-dqn (https://github.com/raphaelschwinger/autoresearch-dqn) - RL algorithm research: applies the autoresearch loop to a CartPole training script, logging 39 iterations that replaced an unstable DQN baseline with a REINFORCE agent that reaches reward 500 in about 5 seconds instead of about 3 minutes.
  • AutoMedal (https://github.com/Flameingmoy/automedal) - Kaggle competition research: adapts Karpathy’s keep/revert loop into strategist, researcher, and experimenter phases, journaling 24 tabular-competition experiments and keeping only lower val_loss changes on a fixed leaderboard-oriented harness.
  • autoresearch-qwen (https://github.com/wadeKeith/autoresearch-qwen) - Document VQA research: adapts Karpathy’s keep/discard loop to Qwen3-VL on the official DocVQA benchmark by fixing evaluate.py, limiting edits to train.py, and accepting only higher full-validation ANLS scores.
  • Wine Variety Prediction with LLMs (https://github.com/ivanfioravanti/wine_variety_classification) - ML fine-tuning research: uses a Karpathy-inspired agent loop to edit one LoRA YAML config, parse HPSEARCH_RESULT accuracy from short MLX runs, log each trial in JSONL, and revert worse hyperparameter changes on the wine-classification benchmark.
  • Autoresearch for Research (https://github.com/michK/Autoresearch-for-Research) - Physics research: uses an LLM to hypothesize, implement, run, and interpret PDE experiments across five domains (Navier-Stokes, Kuramoto-Sivashinsky, Gray-Scott, NLS, CGLE), achieving 7-24× score improvements with no problem-specific training.
  • ScaleAutoResearch-Ramsey (https://github.com/ypwang61/ScaleAutoResearch-Ramsey) - Mathematics research: scales autoresearch to Ramsey number computation, discovering new lower bounds R(3,17)≥93 and R(4,15)≥160 that improved 32-year-old records after even DeepMind’s AlphaEvolve could not.
  • bountyhunter: Vesuvius Autoresearch (https://github.com/jonmarrs/vesuvius-autoresearch) - Scientific research: deploys the first autonomous research swarm for the Vesuvius Challenge, running a full CV research loop with hypothesis generation, hyperparameter optimization

Similar Articles

@yibie: awesome-autoresearch updated, adding 7 entries. pi-lifeline — a very practical pi extension: let local small models run autoresearch, automatically ask stronger models for help when stuck. Tobi Lütke also mentioned this on X...

X AI KOLs Timeline

awesome-autoresearch list updated, adding 7 entries, including pi-lifeline (local small model asks strong model for help), Marketing Mix Modeling 12x improvement, TokenTelemetry open-source tracking tool, etc., covering autoresearch application cases across multiple industries.

@yibie: awesome-autoresearch updated, added 6 entries. Trace2Evolve — applying autoresearch to the self-evolution of customer service agents. Automatically generate difficult cases, score traces, classify failure reasons, only retain improvements when both benchmark and reliability gate pass...

X AI KOLs Timeline

awesome-autoresearch list updated, adding 6 application cases based on Karpathy's autoresearch pattern, covering scenarios such as customer service agent self-evolution, Shell integration, code configuration self-optimization, RAG tuning, and ASO.

@yibie: This week's autoresearch ecosystem evidence scan: 9 new records, total count 383. AutoResearch-RL: A continuous RL research framework with http://prepare.py/train.py isolation, supporting LLM/hybrid strategy experiment scheduling l…

X AI KOLs Timeline

This week, 9 new records were added to the autoresearch ecosystem, bringing the total to 383, covering multiple open-source tools and projects such as the AutoResearch-RL reinforcement learning framework, lance-autoresearch database kernel optimization, and Clio prediction market backtesting framework.