@yibie: 3 new additions this round: 1. Auto-Quant: Apply Karpathy autoresearch to FreqTrade cryptocurrency strategy backtesting, evolving multi-strategy combinations across 5 trading pairs through a keep/discard loop. 2. Universal…
Summary
yibie shared three new entries from the awesome-autoresearch list, covering automated quantitative trading, universal skill optimization, and a Claude Code plugin.
View Cached Full Text
Cached at: 06/03/26, 05:45 AM
3 new additions this round:
- Auto-Quant: Applies Karpathy-style autoresearch to FreqTrade cryptocurrency strategy backtesting, evolving a multi-strategy portfolio across 5 trading pairs through keep/discard loops.
- Universal Skill: Reframes autoresearch as a general-purpose prompt optimization skill for Claude Code/Cursor, automatically scanning codebases, defining evaluation criteria, and retaining high-scoring variants.
- Claude Code Plugin: Generalizes autoresearch into a dual-skill system (discovery + optimization), running autonomous experimentation loops on any quantifiable metric.
http://github.com/yibie/awesome-autoresearch… Currently 428 entries
yibie/awesome-autoresearch
Source: https://github.com/yibie/awesome-autoresearch
awesome-autoresearch
A curated awesome list of public autoresearch use cases across industries.
This README is the homepage aggregate of the current category files, so the latest accepted entries are visible here without drilling into subpages.
The repository treats all categories equally — each entry lives in exactly one category, chosen by its direct autoresearch application domain. A dedicated Related Practices / Discussions category captures credible public practice signals — X threads, Reddit discussions, and interviews — that describe real autoresearch usage even when no strong standalone case page exists yet.
Why this list
Most discussions about autoresearch are still scattered, vague, or overly tool-centric. This list is designed to answer two practical questions quickly:
- Where has autoresearch already been used in real workflows?
- Which patterns can transfer across industries?
This is not a comprehensive database. It is a high-signal, fast-scanning field guide.
Inclusion criteria
An entry should meet all of the following:
- The source is public and citable.
- The example is directly related to autoresearch, not just a generic research or monitoring agent.
- The source explicitly mentions
autoresearch, cites Karpathy’s autoresearch, or clearly shows a modify → verify → keep/discard → repeat loop. - The summary explains the scenario, method, and value in one sentence.
We do not include:
- Generic research agents, monitoring agents, or multi-agent systems with no explicit autoresearch loop.
- Pure theory or opinion without a concrete practice.
- Generic AI commentary with no autoresearch workflow.
- Long write-ups inside the list itself.
- Sources that are private, inaccessible, or too vague to classify.
Current coverage
- Scientific Research — 60 entries
- Software / Systems Optimization — 51 entries
- Evaluation / Red Teaming — 21 entries
- Finance / Trading — 31 entries
- Personal Knowledge / Humanities — 2 entries
- Knowledge Base / RAG Preparation — 2 entries
- Market Research — 0 entries
- Workflow Automation — 4 entries
- Infra / Skills / Forks — 122 entries
- Related Practices / Discussions — 135 entries
Open categories still being seeded
- Competitive Intelligence — 0 entries
- Content Research — 0 entries
- Customer Discovery — 0 entries
- Lead Generation — 0 entries
- Trend Monitoring — 0 entries
Each entry lives in exactly one category. When a project could fit multiple categories, we choose the one closest to its direct application domain.
Browse by category
- Scientific Research (source)
- Software / Systems Optimization (source)
- Evaluation / Red Teaming (source)
- Finance / Trading (source)
- Personal Knowledge / Humanities (source)
- Knowledge Base / RAG Preparation (source)
- Workflow Automation (source)
- Infra / Skills / Forks (source)
- Related Practices / Discussions (source)
Full list
Scientific Research
Source file: categories/scientific-research.md
- AutoResearchClaw (https://github.com/aiming-lab/AutoResearchClaw) - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
- Sibyl Research System (https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem) - Scientific research: builds a fully autonomous AI scientist on Claude Code with inner research-iteration loops and outer self-evolution across projects.
- autoresearch-rl (https://github.com/vivekvkashyap/autoresearch-rl) - RL research: applies the autoresearch pattern to RL post-training by iterating on one training config, running fixed-time experiments, and keeping only eval improvements.
- autoresearch-robotics (https://github.com/jellyheadandrew/autoresearch-robotics) - Robotics research: adapts Karpathy-style autoresearch to MuJoCo and Gymnasium robotics tasks by editing one training file, evaluating fixed-budget runs, and using simulator renderings plus vision feedback to keep only better policies.
- autoslam (https://github.com/pglira/autoslam) - LiDAR SLAM research: applies an autoresearch-style autonomous loop to LiDAR SLAM method design and evaluation on the KITTI odometry benchmark, accumulating a leaderboard over weeks through iterative keep/discard experiments.
- Tinker-Explorer (https://github.com/mohammed840/autoresearch-explorer) - Evidence-retrieval research: adapts the autoresearch pattern to GRPO document exploration, comparing reward designs and keeping only retrieval policies that answer multihop questions more accurately under a token budget.
- Autoresearch on an old research idea (https://ykumar.me/blog/eclip-autoresearch/) - Multimodal retrieval research: applies Claude Code autoresearch to an old eCLIP idea, running 42 fixed-budget experiments with commit/revert decisions and cutting mean rank from 344.68 to 157.43.
- autoresearch-at-home (https://github.com/mutable-state-inc/autoresearch-at-home) - Distributed ML research: coordinates a SETI@home-style swarm of agents that claim experiments, share full
train.pyresults through Ensue, and collectively drive down val_bpb across different GPUs. - autoresearch-paper-benchmark (https://github.com/roth-andreas/autoresearch-paper-benchmark) - Graph ML research: runs paper-driven campaigns on a fixed Peptides-func benchmark by editing
train.py, logging 300-second experiments, and testing only the best validation-AP model at campaign end. - autoresearch-cifar10 (https://github.com/GuillaumeErhard/autoresearch-cifar10) - Vision research: applies autoresearch to CIFAR-10 ResNet training on a 3090, iterating under fixed time budgets and keeping changes that lift accuracy beyond a 91.89% baseline.
- autoresearch-yolo (https://github.com/Tuesdayyy/autoresearch-yolo) - Vision research: adapts Karpathy’s one-file keep/revert loop to YOLOv8 on COCO128, editing only
train.py, running 10-minute experiments, and raisingmAP50-95from 0.3820 to 0.8303 in the published results ledger. - AutoResearch-GenPose (https://github.com/PeleWang/AutoResearch-GenPose) - Vision research: adapts autoresearch to CIFAR-10 UNet denoising by editing one training file, running fixed 5-minute experiments, and keeping only val_psnr improvements.
- MLP-AutoResearch (https://github.com/HuangShengZeBlueSky/MLP_AutoResearch) - MNIST training research: ports Karpathy’s single-file loop to an MLP classifier, fixing 20-epoch runs and greedy keep/revert decisions that raised handwritten-digit accuracy from 0.9809 to 0.9836.
- autoresearch-medimage (https://github.com/mattlungrenmd/autoresearch-medimage) - Medical imaging research: adapts Karpathy’s
prepare.py+train.py+results.tsvloop to 2D imaging tasks, using short-budget candidate discovery and staged follow-up validation to surface stronger ChestXray14 models. - Agentic MR sequence development (https://arxiv.org/abs/2604.13282) - MRI sequence research: uses Agent4MR with physics-aware validation to let agents generate, refine, and autoresearch PyPulseq sequences for a fluid-suppressed spin-echo EPI challenge across model generations.
- autocircuit (https://github.com/qelloman/autocircuit) - Analog circuit optimization: adapts Karpathy’s autoresearch to a SKY130 two-stage op-amp, editing
optimize.py, running ngspice, and keeping only parameter changes that expand the GBW-versus-power Pareto front under phase-margin constraints. - Photonic Device Auto-Design Agent (https://github.com/flexcompute/autophotonicdesign) - Photonic design research: applies a Karpathy-inspired loop to one
design.py, running layout previews, fabrication DRC, FDTD simulations, and keep/discard decisions to optimize devices such as splitters, tapers, and crossings. - fe-autoresearch (https://github.com/ezemriv/fe-autoresearch) - Tabular ML research: applies the autoresearch loop to LightGBM feature engineering on the UCI Bank Marketing dataset by editing one
engineer_features()target, training against fixed AUC metrics, and keeping only improvements. - AutoFeaTune (https://github.com/PranavBedi11/AutoFeaTune) - Tabular ML research: edits only
features.py, trains XGBoost under 5-fold CV composite scoring, and uses a git ratchet plus graph memory to keep or revert feature transforms over repeated autoresearch runs. - Paper Lantern improves Autoresearch (https://www.paperlantern.ai/blog/auto-research-case-study) - ML research augmentation: connects a 2M-paper MCP server to autoresearch, letting the agent cite 100 papers across 100 experiments and reach a 3.2% lower 2-hour validation loss than the same run without paper access.
- Subtractive Search in a Mature Tabular Pipeline (https://github.com/jhamandeep/autoresearch-tabular-case-study) - Tabular ML research: applies Karpathy’s autoresearch to a churn-prediction XGBoost pipeline, running 116 autonomous experiments and lifting subsample AUC from 0.902892 to 0.916721 largely by removing noisy target-encoded features.
- autoresearch-connect4 (https://github.com/alessoh/autoresearch-connect4) - Game AI research: adapts Karpathy’s three-file autoresearch loop to Connect Four by editing
train.py, training 5-minute self-play runs, and keeping only changes that improve weighted win rate against fixed opponents. - autoresearch-tabular (https://github.com/saikatkumardey/autoresearch-tabular) - Tabular ML research: adapts Karpathy’s three-file loop to the Adult Income benchmark by editing only
train.py, running fixed 2-minute experiments, and keeping onlyval_aucimprovements. - ocr-autoresearch (https://github.com/zhd36/ocr-autoresearch) - OCR research: adapts autoresearch to ICDAR2015 scene-text recognition by editing one
train.py, running fixed 5-minute CRNN+CTC experiments, and keeping only lower validation character error rates. - xgboost-autoresearch (https://github.com/szilard/xgboost-autoresearch) - XGBoost model optimization: provides a reusable template and a concrete end-to-end example that autonomously iterates on feature engineering and XGBoost hyperparameters for tabular datasets, actively searching the web for domain knowledge and ML best practices to inform each experiment, and delivering sustained AUC improvements across successive iterations.
- Tennis XGBoost Autoresearch (https://github.com/buildoak/tennis-xgboost-autoresearch) - Sports analytics research: applies a Karpathy-style keep/revert loop to a 245K-match tennis XGBoost pipeline, then hardens the evaluator after the agent learned to game mutable ROC-AUC scoring.
- Bio-Autoresearch (https://github.com/monk1337/Bio-Autoresearch) - Drug discovery research: applies a Karpathy-style autoresearch loop to rare-disease drug repurposing on PrimeKG, running 15 GPU experiments with keep/revert decisions and lifting held-out per-disease AUPRC from 0.284 to 0.761.
- autoresearch-quantum (https://github.com/saymrwulf/autoresearch-quantum) - Quantum research: runs incumbent/challenger autoresearch ratchets for encoded magic-state experiments, screens candidates on cheap noisy simulations, and promotes only justified challengers to expensive backends while logging transferable lessons.
- kaggle-autoresearch (https://github.com/aplassard/kaggle-autoresearch) - Tabular ML research: adapts Karpathy-style autoresearch to Kaggle competitions such as Titanic, House Prices, and Store Sales by iterating on feature and model code, logging approved baselines, and accepting only cross-validation improvements over fixed thresholds.
- MiniMax M2.7: Early Echoes of Self-Evolution (https://www.minimax.io/news/minimax-m27-en) - AI-lab research: describes an internal research agent that automates 30%-50% of RL workflow and a 100+ round keep/revert scaffold-optimization loop that improved internal evaluation scores by 30%.
- autoresearch-macro (https://github.com/wegard/autoresearch-macro) - Macroeconomic forecasting research: runs LLM-guided outer-loop search over Chronos-2 covariates, transforms, and fine-tuning settings, keeping only validation-era forecast improvements across pseudo-real-time Norway, Canada, and Sweden benchmarks.
- autoresearch-dqn (https://github.com/raphaelschwinger/autoresearch-dqn) - RL algorithm research: applies the autoresearch loop to a CartPole training script, logging 39 iterations that replaced an unstable DQN baseline with a REINFORCE agent that reaches reward 500 in about 5 seconds instead of about 3 minutes.
- AutoMedal (https://github.com/Flameingmoy/automedal) - Kaggle competition research: adapts Karpathy’s keep/revert loop into strategist, researcher, and experimenter phases, journaling 24 tabular-competition experiments and keeping only lower
val_losschanges on a fixed leaderboard-oriented harness. - autoresearch-qwen (https://github.com/wadeKeith/autoresearch-qwen) - Document VQA research: adapts Karpathy’s keep/discard loop to Qwen3-VL on the official DocVQA benchmark by fixing
evaluate.py, limiting edits totrain.py, and accepting only higher full-validation ANLS scores. - Wine Variety Prediction with LLMs (https://github.com/ivanfioravanti/wine_variety_classification) - ML fine-tuning research: uses a Karpathy-inspired agent loop to edit one LoRA YAML config, parse
HPSEARCH_RESULTaccuracy from short MLX runs, log each trial in JSONL, and revert worse hyperparameter changes on the wine-classification benchmark. - Autoresearch for Research (https://github.com/michK/Autoresearch-for-Research) - Physics research: uses an LLM to hypothesize, implement, run, and interpret PDE experiments across five domains (Navier-Stokes, Kuramoto-Sivashinsky, Gray-Scott, NLS, CGLE), achieving 7-24× score improvements with no problem-specific training.
- ScaleAutoResearch-Ramsey (https://github.com/ypwang61/ScaleAutoResearch-Ramsey) - Mathematics research: scales autoresearch to Ramsey number computation, discovering new lower bounds R(3,17)≥93 and R(4,15)≥160 that improved 32-year-old records after even DeepMind’s AlphaEvolve could not.
- bountyhunter: Vesuvius Autoresearch (https://github.com/jonmarrs/vesuvius-autoresearch) - Scientific research: deploys the first autonomous research swarm for the Vesuvius Challenge, running a full CV research loop with hypothesis generation, hyperparameter optimization
Similar Articles
@yibie: This week awesome-autoresearch added 3 items: 1. autoslam: Applying Karpathy's autoresearch loop to LiDAR SLAM method design, accumulating experimental leaderboard on KITTI benchmark 2. Bir…
This week awesome-autoresearch added three items, including the autoslam project that applies Karpathy's autoresearch loop to LiDAR SLAM, and two blog posts analyzing the original experiments and revealing metric gaming.
@yibie: awesome-autoresearch updated, adding 7 entries. pi-lifeline — a very practical pi extension: let local small models run autoresearch, automatically ask stronger models for help when stuck. Tobi Lütke also mentioned this on X...
awesome-autoresearch list updated, adding 7 entries, including pi-lifeline (local small model asks strong model for help), Marketing Mix Modeling 12x improvement, TokenTelemetry open-source tracking tool, etc., covering autoresearch application cases across multiple industries.
@yibie: awesome-autoresearch updated, added 6 entries. Trace2Evolve — applying autoresearch to the self-evolution of customer service agents. Automatically generate difficult cases, score traces, classify failure reasons, only retain improvements when both benchmark and reliability gate pass...
awesome-autoresearch list updated, adding 6 application cases based on Karpathy's autoresearch pattern, covering scenarios such as customer service agent self-evolution, Shell integration, code configuration self-optimization, RAG tuning, and ASO.
@yibie: This week's autoresearch ecosystem evidence scan: 9 new records, total count 383. AutoResearch-RL: A continuous RL research framework with http://prepare.py/train.py isolation, supporting LLM/hybrid strategy experiment scheduling l…
This week, 9 new records were added to the autoresearch ecosystem, bringing the total to 383, covering multiple open-source tools and projects such as the AutoResearch-RL reinforcement learning framework, lance-autoresearch database kernel optimization, and Clio prediction market backtesting framework.
@yibie: awesome-autoresearch updated, added 3 entries. dreamworld — world model research. Applied the autoresearch loop to pixel-level world model training (CarRacing-v3), where the agent can perform keep/discard experiments autonomously in tokenizer, dy…
awesome-autoresearch updated, added dreamworld (world model research), Odyssey Engine (general iterative engine), and an article by Kirill Krainov on self-improvement of agentic coding.