@yibie: awesome-autoresearch 更新,新增 6 条目。 Trace2Evolve — 把 autoresearch 用在客服 agent 的自我进化上。自动生成困难 case、评分 trace、分类失败原因,只在 benchma…
摘要
awesome-autoresearch 列表更新,新增 6 个基于 Karpathy autoresearch 模式的应用案例,涵盖客服 agent 自我进化、Shell 集成、代码配置自我优化、RAG 调优和 ASO 等场景。
查看缓存全文
缓存时间: 2026/05/20 16:35
awesome-autoresearch 更新,新增 6 条目。
Trace2Evolve — 把 autoresearch 用在客服 agent 的自我进化上。自动生成困难 case、评分 trace、分类失败原因,只在 benchmark 和可靠性门禁都通过时才保留改进。跨零售和航司任务都有验证。
bob-autoresearch — Karpathy 的 autoresearch 移植到了 IBM Bob Shell。用贪吃蛇 AI 做 demo,10-15 轮 keep/discard 迭代就从随机乱跑到 Hamilton 路径满分。
claude-code-nightly — 让你的 Claude Code 配置每晚自己进化。从会话历史中自动挖掘 eval suite,然后用 Karpathy 式循环改进 CLAUDE.md、hooks、skills 和 memory,git 回滚保护。
AutoRAG Research Lab — 用 Karpathy mode 跑 RAG 流水线调优。Planner → Worker → Evaluator 多 agent 协作,Docker 沙箱隔离,语义无操作检测防假迭代。
Google Play Description Autoresearcher — autoresearch 应用到 ASO。每轮跑 ~25 次关键词密度实验,只保留能提升匹配分数的 Google Play 描述版本。
222 实现 + 114 讨论 = 336 条目
yibie/awesome-autoresearch
Source: https://github.com/yibie/awesome-autoresearch
awesome-autoresearch
A curated awesome list of public autoresearch use cases across industries.
This README is the homepage aggregate of the current category files, so the latest accepted entries are visible here without drilling into subpages.
The repository distinguishes between:
- primary categories for stronger case evidence such as repos, project pages, and concrete write-ups
- secondary overlap categories for cross-cutting patterns that reuse the same evidence from another angle
- Related Practices / Discussions for credible public practice signals — especially X threads, Reddit discussions, and interviews — that describe real autoresearch usage even when no strong standalone case page exists yet.
Why this list
Most discussions about autoresearch are still scattered, vague, or overly tool-centric. This list is designed to answer two practical questions quickly:
- Where has autoresearch already been used in real workflows?
- Which patterns can transfer across industries?
This is not a comprehensive database. It is a high-signal, fast-scanning field guide.
Inclusion criteria
An entry should meet all of the following:
- The source is public and citable.
- The example is directly related to autoresearch, not just a generic research or monitoring agent.
- The source explicitly mentions
autoresearch, cites Karpathy’s autoresearch, or clearly shows a modify → verify → keep/discard → repeat loop. - The summary explains the scenario, method, and value in one sentence.
We do not include:
- Generic research agents, monitoring agents, or multi-agent systems with no explicit autoresearch loop.
- Pure theory or opinion without a concrete practice.
- Generic AI commentary with no autoresearch workflow.
- Long write-ups inside the list itself.
- Sources that are private, inaccessible, or too vague to classify.
Current coverage
Primary categories
- Scientific Research — 41 entries
- Software / Systems Optimization — 44 entries
- Evaluation / Red Teaming — 16 entries
- Finance / Trading — 25 entries
- Personal Knowledge / Humanities — 2 entries
- Infra / Skills / Forks — 94 entries
- Related Practices / Discussions — 114 entries
Secondary overlap categories
- Knowledge Base / RAG Preparation — 4 entries
- Market Research — 1 entry
- Workflow Automation — 9 entries
Open categories still being tracked
- Competitive Intelligence — 0 entries
- Content Research — 0 entries
- Customer Discovery — 0 entries
- Lead Generation — 0 entries
- Trend Monitoring — 0 entries
Some entries intentionally appear in more than one overlap category when the same project is both a domain case and a reusable workflow pattern.
Browse by category
- Scientific Research (source)
- Software / Systems Optimization (source)
- Evaluation / Red Teaming (source)
- Finance / Trading (source)
- Personal Knowledge / Humanities (source)
- Infra / Skills / Forks (source)
- Related Practices / Discussions (source)
- Knowledge Base / RAG Preparation (source)
- Market Research (source)
- Workflow Automation (source)
Full list
Scientific Research
Source file: categories/scientific-research.md
- AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
- Sibyl Research System - Scientific research: builds a fully autonomous AI scientist on Claude Code with inner research-iteration loops and outer self-evolution across projects.
- autoresearch-rl - RL research: applies the autoresearch pattern to RL post-training by iterating on one training config, running fixed-time experiments, and keeping only eval improvements.
- autoresearch-robotics - Robotics research: adapts Karpathy-style autoresearch to MuJoCo and Gymnasium robotics tasks by editing one training file, evaluating fixed-budget runs, and using simulator renderings plus vision feedback to keep only better policies.
- Tinker-Explorer - Evidence-retrieval research: adapts the autoresearch pattern to GRPO document exploration, comparing reward designs and keeping only retrieval policies that answer multihop questions more accurately under a token budget.
- Autoresearch on an old research idea - Multimodal retrieval research: applies Claude Code autoresearch to an old eCLIP idea, running 42 fixed-budget experiments with commit/revert decisions and cutting mean rank from 344.68 to 157.43.
- autoresearch-at-home - Distributed ML research: coordinates a SETI@home-style swarm of agents that claim experiments, share full
train.pyresults through Ensue, and collectively drive down val_bpb across different GPUs. - autoresearch-paper-benchmark - Graph ML research: runs paper-driven campaigns on a fixed Peptides-func benchmark by editing
train.py, logging 300-second experiments, and testing only the best validation-AP model at campaign end. - autoresearch-cifar10 - Vision research: applies autoresearch to CIFAR-10 ResNet training on a 3090, iterating under fixed time budgets and keeping changes that lift accuracy beyond a 91.89% baseline.
- autoresearch-yolo - Vision research: adapts Karpathy’s one-file keep/revert loop to YOLOv8 on COCO128, editing only
train.py, running 10-minute experiments, and raisingmAP50-95from 0.3820 to 0.8303 in the published results ledger. - AutoResearch-GenPose - Vision research: adapts autoresearch to CIFAR-10 UNet denoising by editing one training file, running fixed 5-minute experiments, and keeping only val_psnr improvements.
- MLP-AutoResearch - MNIST training research: ports Karpathy’s single-file loop to an MLP classifier, fixing 20-epoch runs and greedy keep/revert decisions that raised handwritten-digit accuracy from 0.9809 to 0.9836.
- autoresearch-medimage - Medical imaging research: adapts Karpathy’s
prepare.py+train.py+results.tsvloop to 2D imaging tasks, using short-budget candidate discovery and staged follow-up validation to surface stronger ChestXray14 models. - Agentic MR sequence development - MRI sequence research: uses Agent4MR with physics-aware validation to let agents generate, refine, and autoresearch PyPulseq sequences for a fluid-suppressed spin-echo EPI challenge across model generations.
- autocircuit - Analog circuit optimization: adapts Karpathy’s autoresearch to a SKY130 two-stage op-amp, editing
optimize.py, running ngspice, and keeping only parameter changes that expand the GBW-versus-power Pareto front under phase-margin constraints. - Photonic Device Auto-Design Agent - Photonic design research: applies a Karpathy-inspired loop to one
design.py, running layout previews, fabrication DRC, FDTD simulations, and keep/discard decisions to optimize devices such as splitters, tapers, and crossings. - fe-autoresearch - Tabular ML research: applies the autoresearch loop to LightGBM feature engineering on the UCI Bank Marketing dataset by editing one
engineer_features()target, training against fixed AUC metrics, and keeping only improvements. - AutoFeaTune - Tabular ML research: edits only
features.py, trains XGBoost under 5-fold CV composite scoring, and uses a git ratchet plus graph memory to keep or revert feature transforms over repeated autoresearch runs. - Paper Lantern improves Autoresearch - ML research augmentation: connects a 2M-paper MCP server to autoresearch, letting the agent cite 100 papers across 100 experiments and reach a 3.2% lower 2-hour validation loss than the same run without paper access.
- Subtractive Search in a Mature Tabular Pipeline - Tabular ML research: applies Karpathy’s autoresearch to a churn-prediction XGBoost pipeline, running 116 autonomous experiments and lifting subsample AUC from 0.902892 to 0.916721 largely by removing noisy target-encoded features.
- autoresearch-connect4 - Game AI research: adapts Karpathy’s three-file autoresearch loop to Connect Four by editing
train.py, training 5-minute self-play runs, and keeping only changes that improve weighted win rate against fixed opponents. - autoresearch-tabular - Tabular ML research: adapts Karpathy’s three-file loop to the Adult Income benchmark by editing only
train.py, running fixed 2-minute experiments, and keeping onlyval_aucimprovements. - ocr-autoresearch - OCR research: adapts autoresearch to ICDAR2015 scene-text recognition by editing one
train.py, running fixed 5-minute CRNN+CTC experiments, and keeping only lower validation character error rates. - xgboost-autoresearch - XGBoost model optimization: provides a reusable template and a concrete end-to-end example that autonomously iterates on feature engineering and XGBoost hyperparameters for tabular datasets, actively searching the web for domain knowledge and ML best practices to inform each experiment, and delivering sustained AUC improvements across successive iterations.
- Tennis XGBoost Autoresearch - Sports analytics research: applies a Karpathy-style keep/revert loop to a 245K-match tennis XGBoost pipeline, then hardens the evaluator after the agent learned to game mutable ROC-AUC scoring.
- Bio-Autoresearch - Drug discovery research: applies a Karpathy-style autoresearch loop to rare-disease drug repurposing on PrimeKG, running 15 GPU experiments with keep/revert decisions and lifting held-out per-disease AUPRC from 0.284 to 0.761.
- autoresearch-quantum - Quantum research: runs incumbent/challenger autoresearch ratchets for encoded magic-state experiments, screens candidates on cheap noisy simulations, and promotes only justified challengers to expensive backends while logging transferable lessons.
- kaggle-autoresearch - Tabular ML research: adapts Karpathy-style autoresearch to Kaggle competitions such as Titanic, House Prices, and Store Sales by iterating on feature and model code, logging approved baselines, and accepting only cross-validation improvements over fixed thresholds.
- MiniMax M2.7: Early Echoes of Self-Evolution - AI-lab research: describes an internal research agent that automates 30%-50% of RL workflow and a 100+ round keep/revert scaffold-optimization loop that improved internal evaluation scores by 30%.
- autoresearch-macro - Macroeconomic forecasting research: runs LLM-guided outer-loop search over Chronos-2 covariates, transforms, and fine-tuning settings, keeping only validation-era forecast improvements across pseudo-real-time Norway, Canada, and Sweden benchmarks.
- autoresearch-dqn - RL algorithm research: applies the autoresearch loop to a CartPole training script, logging 39 iterations that replaced an unstable DQN baseline with a REINFORCE agent that reaches reward 500 in about 5 seconds instead of about 3 minutes.
- AutoMedal - Kaggle competition research: adapts Karpathy’s keep/revert loop into strategist, researcher, and experimenter phases, journaling 24 tabular-competition experiments and keeping only lower
val_losschanges on a fixed leaderboard-oriented harness. - autoresearch-qwen - Document VQA research: adapts Karpathy’s keep/discard loop to Qwen3-VL on the official DocVQA benchmark by fixing
evaluate.py, limiting edits totrain.py, and accepting only higher full-validation ANLS scores. - Wine Variety Prediction with LLMs - ML fine-tuning research: uses a Karpathy-inspired agent loop to edit one LoRA YAML config, parse
HPSEARCH_RESULTaccuracy from short MLX runs, log each trial in JSONL, and revert worse hyperparameter changes on the wine-classification benchmark. - Autoresearch for Research - Physics research: uses an LLM to hypothesize, implement, run, and interpret PDE experiments across five domains (Navier-Stokes, Kuramoto-Sivashinsky, Gray-Scott, NLS, CGLE), achieving 7-24× score improvements with no problem-specific training.
- ScaleAutoResearch-Ramsey - Mathematics research: scales autoresearch to Ramsey number computation, discovering new lower bounds R(3,17)≥93 and R(4,15)≥160 that improved 32-year-old records after even DeepMind’s AlphaEvolve could not.
- bountyhunter: Vesuvius Autoresearch - Scientific research: deploys the first autonomous research swarm for the Vesuvius Challenge, running a full CV research loop with hypothesis generation, hyperparameter optimization, model training, and keep/discard decisions to uncover optimal configurations for reading ancient carbonized scrolls.
- AutoGo - Game AI research: applies Karpathy’s autoresearch to training a Go-playing AI from scratch, using cheap self-play data and fixed-time eval to keep only network architecture and training improvements across iterations.
- autoresearch-speedrun - NanoGPT speedrun research: applies Karpathy-style autoresearch with a two-stage paired-seed acceptance funnel and a curated literature-backed idea pool, cutting val_loss from 3.9249 to 3.8093 across 30 experiments with 6 accepted changes out of ~24 honest discards.
- autoresearch-gym - Robotics RL research: provides a Gymnasium workbench for agent-driven RL research loops under fixed benchmark contracts with live dashboard and deterministic evaluation, following the Karpathy autoresearch pattern with candidate training and metric inspection.
- automat - Materials science research: applies Karpathy’s autoresearch to design composition-only descriptors for materials regression tasks, requiring the agent to justify scientific reasoning behind each proposed descriptor before evaluating against held-out splits.
Software / Systems Optimization
Source file: categories/software-systems-optimization.md
- karpathy/autoresearch - ML training optimization: the original autoresearch loop where an agent edits a GPT training script, runs fixed-time experiments, and keeps only improvements in validation bits-per-byte.
- AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
- autoresearch-webgpu - Browser ML optimization: ports Karpathy’s autoresearch into the browser so agents can generate training code, run GPU-backed experiments, and feed losses back into the next iteration.
- autoresearch-local-llm - Local ML optimization: replaces Claude Code with a local Qwen model to run the standard autoresearch keep/revert loop on a shared single GPU.
- Shopify Liquid performance work via autoresearch - Software optimization: Tobi Lütke applied an autoresearch loop to Shopify’s Liquid template engine, producing 93 automated commits that improved parse+render performance by 53% with 61% fewer allocations.
- Autoresearch for SAT Solvers - SAT solver optimization: runs parallel MaxSAT experiments, updates reusable solver code plus expert memory, and improves public benchmark configurations against 2024 competition baselines.
- autoresearch — Heuristic CP Edition - Heuristic solver optimization: adapts autoresearch to C++ competitive-programming solvers by editing only
solver.cpp, scoring fixed benchmark instances, and keeping only lower average solution costs. - Autoresearch for game development - HTML5 Game Development: Runs agents to build better games based on player feedback and usage metrics, benchmarks using game ELO in 1/1 matchups.
- SiliconSwarm@Ensue - Apple Silicon inference optimization: uses a multi-agent autoresearch loop to test ANE graph changes across chips and reports up to 6.31× lower median DistilBERT latency than CoreML.
- Flash-MoE - Apple Silicon inference optimization: uses a Claude Code autoresearch loop to run 43 Metal optimization experiments on Qwen3.5-397B and reach 20.34 tok/s on an M5 Max by overlapping SSD reads with GPU compute.
- Research-Driven Agents: When an agent reads before it codes - LLM inference optimization: extends Karpathy’s autoresearch with a literature-review phase that reads papers and competing forks before parallel llama.cpp experiments, landing five kernel fusions and about 15% faster x86 flash-attention generation in about 3 hours.
- Rails controller tuning with Claude Code /loop autoresearch - Backend performance optimization: adapts Karpathy’s keep/discard loop to Rails controller latency by locking benchmark scripts and test data, running 10-minute cycles, and auto-reverting regressions.
- Pytest speedups via autoresearch feedback loops - Test performance optimization: applies autoresearch to a backend pytest suite with a fixed evaluation harness, seven autonomous experiments, and a 295s → 71s keep/discard improvement path.
- autoresearch-sudoku - Solver optimization: uses an enhanced autoresearch loop to rewrite a Rust sudoku solver over 312 experiments and beat Tdoku plus rust_sudoku on 4 of 6 standard benchmark datasets.
- autospec - Backend service generation: applies an autoresearch-inspired keep-or-revert loop to natural-language business rules, iteratively building a Spring Boot service until Gradle and JUnit evaluation pass without regression.
- How I used autoresearch to fix Gumroad’s flaky tests in a week - Test reliability: uses OpenClaw autoresearch to run 206 commits and 94 CI cycles that fixed 13 flaky tests while surfacing a real file-ID remapping bug.
- WinMoE - Windows inference optimization: uses an AI-driven autoresearch methodology with one-change measurements and keep-or-reject ledgers to lift Qwen3.5-397B throughput from 0.44 to 1.9 tok/s on consumer hardware.
- ZK Autoresearch — Plonky3 DFT Optimizer - ZK prover optimization: applies Karpathy’s autoresearch pattern to Plonky3’s DFT code, running Rust tests plus Criterion benchmarks and keeping only commits that reduce
coset_lde_batchtime on BabyBear field workloads. - autoresearch-go-ane - Apple Silicon training optimization: ports Karpathy’s loop to a Go plus ANE LLM trainer, benchmarking fixed 5-minute TinyStories runs with
benchstatand keeping only lowerval_lossconfigurations. - openroad-autoresearch-ibex - Chip design optimization: applies a fixed-harness autoresearch loop to OpenROAD RTL-to-GDSII experiments on the IBEX CPU, using scout-promote screening and objective-aware history to keep only timing, area, or power improvements.
- OpenCLI - Browser automation reliability: adds a Karpathy-style autoresearch harness to OpenCLI, cycling review → modify → commit → verify → decide against fixed V2EX, Zhihu, browser, and save-as-CLI eval suites to keep only reliability improvements.
- autoresearch-cublas-sam3 - GPU kernel optimization: applies an autoresearch loop to SAM3 GEMM tuning by mutating one config at a time, benchmarking on real GPUs, and keeping only changes that improved throughput by 2.14% over 120 experiments on an RTX 3090.
- autoresearch-mamba - Mamba training optimization: adapts Karpathy’s fixed-evaluator, 5-minute keep/discard loop to MLX Mamba-2, Mamba-3, and hybrid Mamba-Transformer models on Apple Silicon by mutating one training surface to lower
val_bpb. - liltrAIner - Local LLM fine-tuning optimization: applies a Karpathy-style autoresearch loop to MLX LoRA runs on Apple Silicon, letting an agent mutate training data or config, score eval prompts, and keep or revert each fine-tuning experiment.
- english-app - Education app optimization: applies an autoresearch-inspired proposer → implement → test → evaluate → keep/discard loop to an English learning app, using pytest, TypeScript checks, and smoke tests to keep only changes scoring at least 6.0 across 10 autonomous iterations.
- How we built the best browser agent with Auto-Research - Browser automation optimization: uses parallel Claude Code auto-research loops against Online-Mind2Web, running 20-cycle harness edits with train/validation splits and reaching 97% on the benchmark while rejecting task-specific overfits.
- Speed up code with pi-autoresearch - Software performance optimization: applies pi-autoresearch to jsonista’s JSON decoding benchmark, keeping only measured wins and lifting one selected benchmark’s throughput by 56% while surfacing overfitting risks in accepted diffs.
- 588x Faster SQLite Ingestion With an Autoresearch Loop - ETL performance optimization: applies pi-autoresearch to a Python financial-data ingestion pipeline, benchmarking 50,000-row SQLite writes and keeping fixes that cut processing time from about 397s to 0.675s.
- auto-data-pipeline-optimization - Data pipeline optimization: adapts Karpathy’s three-file keep/revert loop to data engineering, letting an agent mutate compression, partitioning, chunking, and query logic against a fixed efficiency score and achieving +11.3% improvement over 20 experiments.
- nnmetal + labrat - Apple Silicon inference optimization: uses an autonomous Zig and Metal autoresearch loop that snapshots engine files, makes one kernel change at a time, runs compile, test, and benchmark gates, and commits only throughput or latency wins above a fixed threshold.
- HashSmith, Part 3: I Automated My Way to a 27% Faster Hash Table - Data-structure performance optimization: uses a Claude Code auto-optimize loop to profile, benchmark, and keep only wins on a JVM SwissTable implementation, landing three accepted changes and 13%-32% gains across eight benchmark scenarios.
- claude-code-bench - AI coding workflow optimization: applies Karpathy-style autoresearch to Claude Code’s 7-dimensional configuration space, running benchmark tasks and keeping only profiles that improve quality-adjusted scores for research depth, correctness, and convention adherence.
- autooptimization - Systems optimization: applies a profile-first autoresearch protocol to codebases like ClickHouse, Chroma, DataFusion, and RocksDB, keeping only statistically benchmarked optimizations backed by stack-level profiling evidence.
- helix-inference-opt - LLM inference optimization: applies a fixed 1-minute autoresearch benchmark to Qwen2.5-0.5B decoding on WikiText-2, rewriting only
infer.pyand keeping throughput gains only when bits-per-byte quality stays within a 1% guard. - autoresearch-inference-optimization - Inference serving optimization: lets an agent rewrite
serve.shplusexperiment.yaml, benchmark OpenAI-compatible servers under throughput, latency, and memory constraints, and keep only higher-scoring serving configs inexperiments.jsonl. - PolyTrader - Trading-system performance optimization: applies autoresearch to PolyTrader’s signal-detection hot path, keeping only test-clean code changes that cut end-to-end tracker latency from 25.7 ms to 0.46 ms across a published 10-iteration benchmark run.
- autoresearch-lora-buzhou - Local LoRA fine-tuning optimization: adapts autoresearch to user-chosen LoRA training goals by establishing a confirmed baseline, changing one parameter at a time, rerunning >1% wins for confirmation, and promoting only verified
val_lossimprovements to the best checkpoint. - ZeroAPI - Model-routing optimization: applies an autoresearch loop to OpenClaw routing policy constants by rerunning evals on routing logs, changing one setting at a time, and keeping only traffic-verified improvements.
- One Hour, 37% Faster: Applying Autoresearch to Our Search Ranking Inference Endpoint - Search ranking optimization: applies Karpathy’s edit → measure → keep/revert loop to idealo’s production LTR inference endpoint, reducing preprocessing latency 5.9× and end-to-end latency 37% in one hour for ~$7.
- smallnest/autoresearch - Software development automation: generalizes Karpathy’s autoresearch into a multi-agent, multi-source (GitHub, local, Baidu iCafe, Codeup) issue-to-PR pipeline with rotator review, scoring gates, and automatic merge for any Git project.
- arete autoresearch-compiler - Compiler optimization: applies a Karpathy-style keep/revert loop to a self-hosting Scheme compiler, benchmarking fixed workloads and keeping only changes that improve runtime performance.
- auto-arch-tournament - CPU architecture optimization: extends Karpathy’s autoresearch loop to SystemVerilog RTL, proposing, implementing, and measuring microarchitectural hypotheses through riscv-formal, Verilator cosim, and FPGA place-and-route to achieve +91.9% CoreMark on a RISC-V core.
- autoresearch-function - Function optimization: adapts Karpathy’s autoresearch to a CPU-only scaffold that benchmarks arbitrary functions against correctness, execution time, memory, and throughput metrics, keeping only edits that improve a composite score.
- vllm-autoresearch - Inference serving optimization: applies Karpathy’s autoresearch to find optimal vLLM serving configurations for Llama 3.3 70B on 2× RTX 5090, editing one config at a time and benchmarking against throughput, latency, and noise-floor metrics with keep/discard decisions.
Evaluation / Red Teaming
Source file: categories/evaluation-red-teaming.md
- Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
- autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
- autoresearch-prompt-optimization - Prompt evaluation: applies the autoresearch loop to a fixed extraction benchmark, iteratively editing one prompt and keeping only accuracy gains on the eval set.
- We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests - AI skill evaluation: runs a prompt-migration skill against six fixed codebase test cases, scores each change on correctness, completeness, and efficiency, and keeps only improvements while cherry-picking around harness overfit.
- AutoPrompter - Prompt evaluation: combines promptfoo-style metrics with autoresearch-style closed-loop iteration, generating datasets, testing target models, and refining prompts through a persistent experiment ledger.
- AutonomousTester - UI testing evaluation: adapts autoresearch to Playwright test generation by editing only
tests/test_suite.py, measuringcoverage_score, and auto-fixing or discarding test changes until coverage improves. - Autoresearch for Agents from Scratch - Support-agent prompt evaluation: applies Karpathy’s keep/revert loop to
system_prompt.md, scoring frozen adversarial support cases by tool-call accuracy and lifting the prompt from 0.05 to 0.80 over 15 experiments. - LLM Privacy + Cost Router — Classifier Experiment - Privacy classification evaluation: runs a Karpathy-style autoresearch experiment across regex and prompt variants for a hybrid LLM privacy classifier, validating the best configuration at 96.7% holdout accuracy with 4.6% false negatives.
- AutoMemory - Agent memory evaluation: lets an agent rewrite its own memory system against LongMemEval, using an immutable evaluator over random question samples and iterating on code plus strategy notes in response to scored failures.
- How to stop your autoresearch loop from cheating - Autoresearch evaluation hardening: reports 71 experiments across nanochat training and MoE compression, showing loops drift quickly unless experiments are isolated and evaluator gates block shortcut gains.
- Autoreason - Output evaluation: extends Karpathy-style autoresearch to subjective writing and coding tasks by running incumbent-versus-revision-versus-synthesis tournaments under blind multi-judge Borda scoring and stopping only when the unchanged version wins twice, outperforming standard self-refinement baselines on writing tasks and 150 CodeContests problems.
- AutoResearchBench - Evaluation benchmark: a dedicated benchmark for autonomous scientific literature discovery with complementary deep and wide research tasks, showing that even the most powerful LLMs achieve only ~9% accuracy on literature-focused open-ended search.
- Jailbreak Autoresearch - Jailbreak red-teaming: applies Karpathy’s autoresearch loop to prompt-harness experiments, testing different header and footer harnesses against a fixed body and scoring responses against a rubric to find more effective jailbreak strategies.
- JustAsk - System prompt extraction: uses autoresearch-style curiosity-driven code agents to iteratively probe and reveal system prompts from frontier LLMs, accepted at ICML 2026.
- DSBench AutoResearch - Benchmark evaluation: applies Karpathy-style hill-climbing to each of DSBench’s 74 Kaggle modeling and 38 Modeloff analysis tasks, running per-task autoresearch repos with a forensic audit that beats 82/112 DSBench baselines.
- Trace2Evolve - Customer-support agent evaluation: applies Karpathy-style autoresearch to improve a tool-using support agent by generating hard cases, scoring traces, classifying failures, and keeping candidate improvements only when benchmark metrics and reliability gates pass across retail and airline task splits.
Finance / Trading
Source file: categories/finance-trading.md
- atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.
- autoresearch-trading - Options trading: applies an autoresearch-style keep/revert loop to SPY strategy parameters, logging each experiment against backtest metrics.
- autoresearch-trading - Trading research: combines Karpathy-style autoresearch with classical optimization so the agent iterates on strategy structure while an optimizer tunes parameters and walk-forward validation decides what survives.
- BTCautoresearch - Bitcoin forecasting: uses Karpathy-style autoresearch to mutate a single formula file, score walk-forward out-of-sample RMSE, and keep only forecasting rules that beat the baseline power law.
- autoresearch-skfolio - Portfolio optimization: edits a single portfolio-research script, runs fixed out-of-sample validation across multiple datasets and reversed-return variants, and keeps only Deflated Sharpe Ratio gains.
- AutoHypothesis - Portfolio research: runs a Karpathy-style autoresearch loop on
agent.py, iterating on DEV data and keeping only stock-selection hypotheses that clear one-shot holdback and walk-forward gates before final holdout evaluation. - autoresearch-glm - Credit scoring: adapts autoresearch to Taiwan credit-default prediction by editing feature-policy code and keeping only validation AUC gains in a fixed logistic-GLM benchmark.
- Feature Research at Machine Speed - Credit underwriting: adapts Karpathy’s autoresearch to
compute_features.py, scoring each iteration by Information Value and ending the round after 25 runs once no new strong features emerged, yielding 58 strong features and a +0.023 AUC lift at a matched 20-feature budget. - autoresearch-markets - Prediction-market trading research: adapts Karpathy’s single-file keep/revert loop to Kalshi data, editing
train.pyand optimizingval_loglosson held-out resolved markets. - Simmer Autoresearch - Prediction-market trading: lets agents mutate skill configs, measure P&L or edge over live trading cycles or historical replays, and auto-commit only the configurations that improve results.
- Autonomous Trading Strategy Research - Crypto trading research: adapts Karpathy’s single-file autoresearch loop to Hyperliquid perpetual futures, backtesting each
strategy.pychange on fixed historical data and keeping only score improvements across 103 autonomous experiments. - PolyEdge AutoResearch - Prediction-market arbitrage: applies a Karpathy-style keep/discard loop to Polymarket Up/Down paper trading, mutating one strategy parameter at a time and scoring each multi-window run on P&L, fill rate, and trading frequency.
- AutoResearch — Autonomous DEX Strategy Discovery - DEX trading research: applies Karpathy-style autoresearch to Base DEX strategies, backtesting one mutation at a time against real Uniswap V3 and Aerodrome data and lifting composite score from 0.421 to 8.176 over 230+ experiments.
- Winning the Paradigm Prediction Market Challenge with Claude Code - Prediction-market market making: uses parallel Claude Code agents as an autoresearch swarm to generate 1,039 strategy variants, run 2,000+ evaluations, and optimize mean edge to a first-place finish in Paradigm’s challenge.
- Autoresearch Trading Strategy Optimizer - Crypto trading research: applies Karpathy’s autoresearch to one editable
strategy.py, hill-climbing on deterministic historical backtests and keeping only commits that improvefinal_portfolio_value / max_drawdown. - Investing Autoresearch - Trading strategy research: uses an autonomous Claude loop to rewrite
strategy.py, backtest on held-out market data, and keep only strategies that improve out-of-sample Sharpe under walk-forward, slippage, and fee validation. - EMA Crossover Autoresearch - Equity trading research: adapts Karpathy’s three-file autoresearch loop to an SBIN EMA strategy, mutating only
strategy.py, backtesting a fixed 10-year Indian equities dataset, and keeping only changes that improve a composite return, Sharpe, and drawdown score. - NSE AutoResearch - Equity trading research: adapts Karpathy’s single-file keep/revert loop to Indian stock backtesting by mutating only
strategy.pyand keeping only lower composite-score strategies over a fixed 10-year NSE harness. - delu-agent - Live crypto trading: runs a self-improving treasury agent on Base with 5 parallel autoresearch loops that evolve the scoring model through 9,000+ backtested experiments 24/7 and execute trades via Bankr without human intervention.
- investment-autoresearch - Trading strategy research: a Claude Code plugin that runs parallel agents testing single-hypothesis strategy variants in isolated git worktrees, scoring each against baseline buy-and-hold, and resetting from verified insights when incremental search plateaus.
- autoresearch-crypto - Crypto trading research: applies Karpathy’s single-file, backtest-driven keep/revert loop to crypto strategy development with fixed historical data and metric-based acceptance gates.
- feat(autoresearch): autonomous experiment engine in Juspay Neurolink - Financial infrastructure research: merges an 8,700-line autonomous experiment engine with Karpathy-style sub-agents and git-backed keep/revert decisions into the Juspay Neurolink product codebase.
- ml-vs-leadlag-jp-stock - Japanese equity trading: applies Karpathy’s autoresearch framework to Japanese stock prediction, comparing ML-driven Sharpe maximization against a classical lead-lag PCA strategy on reproducible benchmarks with published write-up.
- autoresearch-backtesting - Equity trading: uses an autonomous LLM agent to iteratively rewrite its PyTorch network for SPY directional prediction, backtesting each mutation and keeping only Sharpe improvements in a train → backtest → keep/discard loop.
- IPL Odds Multi-Agent Autoresearch - Sports betting research: deploys 7 Claude Code subagents to forecast IPL match outcomes, grades each forecast by Brier score, and autonomously mutates reasoning rules across a full season without human intervention.
Personal Knowledge / Humanities
Source file: categories/personal-knowledge-humanities.md
- autoresearch-genealogy - Genealogy: uses Claude Code
/autoresearchprompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault. - claude-obsidian - Personal knowledge: uses a Karpathy-style
/autoresearchskill to run multi-round web research with gap-filling and file source-backed concept, entity, and synthesis pages into a compounding Obsidian wiki vault.
Infra / Skills / Forks
Source file: categories/infra-skills-forks.md
- n-autoresearch - Autoresearch infra: extends Karpathy’s loop with structured experiment state, multi-GPU parallelism, adaptive search, and crash recovery.
- autoresearch-evaluation-harness - Evaluation infrastructure: compares autoresearch-style proposal strategies under fixed task adapters, explicit scalar scoring, and hard keep/discard gates.
- autoresearch-mlx - Apple Silicon fork: ports Karpathy’s autoresearch to MLX while keeping the fixed-time training budget, single mutable file, and git keep/revert loop.
- Claude Autoresearch - Claude Code skill: generalizes Karpathy’s autoresearch into a reusable modify → verify → keep/discard loop for measurable goals beyond ML.
- claude-autoresearch - Claude Code plugin: runs autoresearch on isolated branches with deterministic verification commands, scheduled overnight sessions, and structured keep/discard reports.
- lazy-developer - Claude Code plugin suite: runs repeated autoresearch phases across coverage, build speed, test speed, complexity, and performance goals while enforcing per-phase file locks and revert-on-regression behavior.
- codex-autoresearch - Codex skill: brings the autoresearch pattern to Codex for unattended metric-driven software iteration with automatic keep/discard decisions.
- gemini-autoresearch - Gemini CLI and Antigravity skill: runs goal-driven overnight improvement loops with verify and guard gates, keeping metric wins and automatically reverting regressions.
- autoresearch-plugin - Claude Code plugin: packages the Karpathy-style experiment loop into init/test/run commands for projects with explicit evaluation scripts and git rollback.
- Artificial General Research - Optimization framework: turns measurable code optimization tasks into autoresearch loops with variance-aware acceptance, artifact detection, and exhausted-approach tracking.
- autoresearch-engram - Memory extension: adds persistent recall, pattern extraction, and reflection steps to Karpathy’s autoresearch so the agent remembers what worked across long runs.
- pi-autoresearch - pi extension: generalizes Karpathy’s autoresearch into experiment tools, a live dashboard, and slash-command skills for metric-driven optimization beyond ML.
- openclaw-autoresearch - OpenClaw plugin: ports pi-autoresearch to OpenClaw with pending-run enforcement, confidence scoring, checkpoint files, and git-backed keep/discard semantics.
- AutoResearch OpenClaw - OpenClaw package: runs Karpathy-style keep/revert loops on one text asset by measuring a scalar metric, committing winners, reverting losers, and logging each session to
results.tsv. - autoresearch-opencode - OpenCode skill: ports pi-autoresearch into OpenCode as a pure skill that logs JSONL experiment runs and resumes autonomous keep/discard loops with built-in tools.
- pi-autoresearch-studio - pi control plane: adds TUI and web dashboards, plan and ideas editing, and selective PR creation on top of pi-autoresearch sessions.
- autoresearch-gen - Scaffold generator: interviews the user, generates a verified autoresearch experiment scaffold, auto-runs the baseline, and repairs broken generated code before handoff to the agent.
- autoresearch-autoresearch - Meta-autoresearch repo: maintains a portable canonical loop distilled from karpathy/autoresearch and adjacent systems so new evidence can update a reusable agent-verifier architecture across domains.
- Bilevel Autoresearch - Meta-autoresearch framework: adds outer loops that rewrite autoresearch search mechanisms themselves and reports multi-run gains on Karpathy’s training benchmark.
- SkyPilot parallel autoresearch - GPU infrastructure: gives Karpathy’s autoresearch access to 16 GPUs so the agent can run parallel experiment waves, validate winners on faster hardware, and reach about 910 runs in about 8 hours.
- autoresearch_deeplake_swarm - Cloud swarm infrastructure: extends Karpathy’s loop with Modal-powered parallel workers and a shared Deeplake experiment notebook so multiple agents can explore
train.pyconcurrently and surface only the best surviving commits. - Autoresearch on Red Hat OpenShift AI - Kubernetes ML infrastructure: runs Karpathy’s autoresearch as a 24-hour OpenShift AI workload, packaging nanochat into containers that logged 198 experiments and improved validation loss by 2.3% without human intervention.
- serverless-autoresearch - SageMaker infrastructure: parallelizes Karpathy’s autoresearch on Spot training jobs so the agent evaluates
train.pycandidates with HUGI-style burst compute instead of paying for idle GPUs. - autoresearch-win-rtx - Windows GPU fork: ports Karpathy’s single-file, 5-minute, val_bpb keep/discard loop to native Windows on consumer RTX GPUs.
- autoresearch-amd - AMD GPU fork: ports Karpathy’s single-file, 5-minute val_bpb keep/discard loop to ROCm by replacing Flash Attention 3 with portable SDPA for RDNA 4 cards.
- autoloop - Agent runtime: generalizes Karpathy’s autoresearch into bounded repo-level loops with inferred eval commands, explicit guardrails, and keep/discard decisions across multiple coding agents.
- GOAL.md - Goal-spec framework: generalizes Karpathy’s autoresearch to repos without native scalar metrics by constructing a project-specific fitness function in
GOAL.md, then running measure → act → verify → keep/revert loops against it. - autoresearch-claude-code - Claude Code plugin: ports pi-autoresearch into a pure plugin skill with JSONL state, slash-command control, and autonomous keep/discard loops for arbitrary METRIC-based benchmarks.
- autoresearch-benchmark - Benchmarking infrastructure: compares four autoresearch-style tools on the same sorting-throughput task and records both performance gains and iteration behavior under a shared setup.
- CORAL - Multi-agent autoresearch infrastructure: runs Claude Code, Codex, or OpenCode workers in isolated worktrees, grades each attempt with
coral eval, and keeps scored improvements while sharing notes and skills across agents. - autoresearch for agents - Agent evaluation template: adapts autoresearch to
agent.pyplus fixedrun_eval.pyanddataset.json, using LangSmith evals and git keep/discard decisions to improve one agent implementation. - autoresearch-automl - Benchmarking research: compares nine classical, LLM-based, and hybrid optimizers on Karpathy’s nanochat task under a shared 24-hour budget, showing code-editing autoresearch is competitive but fixed-space classical HPO still wins.
- autoresearch-anycloud - Cloud GPU infrastructure: wraps Karpathy’s autoresearch in a unified Mac and cloud runner with platform setup, budget watchdogs, result collection, and automatic teardown across AWS, GCP, Azure, and OCI.
- skill-autoresearch for Hermes Agent - Hermes skill: optimizes prompts, scripts, and validators through baseline → diagnose → patch → re-evaluate → keep/revert loops with dependency checks and conservative holdout rules.
- autoresearch-anything - Claude Code skill: scaffolds Karpathy-style autoresearch pipelines for measurable business metrics by generating connectors, persistence setup, and deploy → measure → keep/discard loops around API-observable outcomes.
- AutoSkill - Skill prompt optimization framework: applies Karpathy’s keep/revert loop to
SKILL.md, mutating one prompt at a time against test cases and improving an auto-reminder skill from 45% to 90% reliability over 60+ autonomous iterations. - EvoSkill - Skill-evolution framework: analyzes failed coding-agent trajectories, proposes skill or prompt changes, evaluates them against benchmarks, and keeps only better agent variants in a Karpathy-style self-improvement loop.
- Skill Forge v2 - Skill and code optimization framework: adapts Karpathy’s autoresearch to
SKILL.mdfiles and generic codebases, using dry-run validation, objective deltas, and keep/revert thresholds to steer autonomous or guided experiment loops. - autoimprove-cc - Claude Code skill optimizer: applies a Karpathy-style autoresearch loop to
SKILL.md, scoring binary assertions fromeval.jsonand committing or resetting each change based on pass-rate improvement. - ehmo/autoresearch-skill - Claude Code and Codex skill: generalizes autoresearch into clean-room red, green, and refactor teams that iteratively find issues, fix them under test, and simplify code on a feature branch while the coordinator keeps only verified progress.
- ResearcherSkill - Claude Code and Codex skill: generalizes autoresearch into git-backed
.lab/sessions with branching experiment trees, convergence detection, and commit/revert control, improving Yggdrasil agent rules from 1.82 to 7.04 in a published loop. - Litmus - Parallel ML research infrastructure: turns OpenClaw into a multi-agent autoresearch lab with branch-isolated workers, scheduled director and synthesizer roles, and keep/revert experiment commits plus shared discoveries and skills.
- Autoresearch CLI - Cross-agent experiment infrastructure: packages Karpathy’s one-file, one-metric keep/revert loop as a Rust CLI that scaffolds configs, validates eval commands, records JSONL results, and installs slash-command skills into multiple coding agents.
- codex-autoresearcher - Codex experiment infrastructure: runs optimization campaigns through separate worker and judge Codex processes, a static
evaluate.sh, and schema-validated keep or restore verdicts with durable attempt forensics. - Codex Autoresearch Harness - Codex experiment harness: wraps
codex execin a bash loop so Codex can run one Karpathy-style experiment at a time, keep or resettrain.pychanges, and compare model variants on the nanochat benchmark. - ExAutoresearch - Elixir autoresearch framework: hot-loads one experiment module at a time, trains GPT variants under fixed budgets across distributed GPU nodes, and uses a referee plus dashboard to early-stop losers and persist the best surviving trials.
- slowresearch - Delayed-feedback experiment skill: adapts autoresearch to content, outreach, pricing, and other publish-and-wait workflows by logging human-reported metrics and proposing the next hypothesis across long feedback cycles.
- AutoAgent - Agent-engineering infrastructure: applies Karpathy’s autoresearch to a single-file Harbor agent harness, rewriting
agent.py, benchmarking scored tasks, and keeping only prompt, tool, or orchestration changes that raise total score. - VibeHQ - Multi-agent coordination infrastructure: applies autoresearch to team protocol design by benchmarking agent swarms, analyzing failure logs, rewriting hub code via
/optimize-protocol, and iterating until coordination flags and token waste fall. - helix - Agent-agnostic autoresearch infrastructure: generalizes Karpathy’s loop into reproducible
helix.yaml+program.mdrepos with backend-swappable agents, append-onlyexperiments.tsvledgers, and independently verifiable example helices. - Autolab Companion Tools - Autoresearch companion infrastructure: adds statistical keep/discard verdicts, experiment-history steering, and multi-agent branch competitions to Karpathy’s GPT-pretraining loop through the
autojudge,autosteer, andautoevolveCLIs. - autoresearch-cpu - CPU ML fork: ports Karpathy’s autoresearch to commodity CPUs by replacing Flash Attention with native SDPA, shrinking defaults for 30-minute local runs, and preserving the same one-file
val_bpbkeep/discard loop without CUDA. - autoresearch-cn - Chinese-language fork: translates Karpathy’s autoresearch into Chinese while preserving the original MIT license, single-file training loop, fixed-time budget, and val_bpb keep/discard git workflow.
- autoresearcher2 - Bayesian research infra: extends Karpathy’s autoresearch with structured Bayesian experiment selection, active-inference-inspired exploration, learntropy-style appraisal, and persistent memory, running v4.8 in production on dual RTX PRO 6000 Blackwell GPUs across NanoGPT training and Atari Breakout RL domains.
- multiautoresearch - Multi-agent research lab: organizes Karpathy-style autoresearch into three sub-projects (pre-training, post-training, inference optimization) with planner, researcher, reviewer, and memory-keeper sub-agents across Claude Code, Codex, OpenCode, and pi harnesses.
- autoresearch-serverless - GCP serverless infra: deploys Karpathy’s autoresearch as a fully autonomous Cloud Run research lab with NVIDIA L4/RTX PRO 6000 GPUs, Gemini Flash reasoning, GCS checkpoints, and Cloud Workflow chaining for multi-day 24/7 studies.
- hugoferreira/autoresearch - Codebase research framework: generalizes Karpathy’s loop into falsifiable hypotheses, isolated experiment worktrees, instrument-backed observations, strict gate review, and reusable lessons for measurable engineering goals.
- scalar-loop - Experiment infrastructure: generalizes Karpathy’s autoresearch into a Python-core CLI that seals harness files, enforces repo scope, and keeps or reverts only edits that pass metric and guard commands.
- Autoresearch Lab - Black-box experiment infrastructure: wraps any measurable code pipeline in Docker sandboxing, Python eval backends, host-side git commit/revert control, and keep/discard research loops.
- evo - Parallel optimization framework: extends Karpathy-style autoresearch with discovered benchmarks, git-worktree tree search, parallel subagents, and commit-only-if-gate-passing score improvements.
- autoresearch-builder - Claude Code plugin: auto-detects project type, mutates one target file, runs the chosen build, test, or train command, and keeps or discards each experiment based on a parsed scalar metric.
- AdaL vs Claude Code: Autoresearch Benchmark - Benchmarking research: publishes head-to-head Karpathy-autoresearch runs on A10 and H100 hardware, comparing best
val_bpb, experiment counts, and kept improvements between AdaL and Claude Code. - 达尔文.skill - Skill optimization framework: adapts Karpathy’s autoresearch to one
SKILL.mdat a time, rescoring structure plus real test outputs and keeping only higher-scoring revisions. - Community Computer - Collaborative experiment network: turns autoresearch runs into signed, peer-reproducible experiment records and auto-publishes keep/discard outcomes from Claude Code or pi over a Radicle peer-to-peer network.
- autoresearch-macos - Apple Silicon fork: ports Karpathy’s autoresearch to macOS with MPS/SDPA fallbacks, preserving the fixed-time training budget, single mutable file, and git keep/revert loop on Apple Silicon Macs.
- autoresearch-tenstorrent - Tenstorrent hardware port: ports Karpathy’s autoresearch to Tenstorrent Wormhole devices with a TT-XLA backend, preserving the 5-minute fixed budget and keep/revert loop while adapting to a non-CUDA AI accelerator.
- autoresearch-prompt-optimization - Prompt optimization skill: adapts Karpathy’s autoresearch to autonomous prompt tuning, lifting accuracy from 74.72% to 100% across 8 keep/discard experiments with a locked eval script.
- autoresearch-council-arena - Writing optimization framework: combines Karpathy’s autoresearch with llm-council so 4 LLMs compete, rank each other, and hill-climb toward the best argument through keep/discard iterations.
- Auto-Research-Recipes - Training recipe research: turns agent-written code, real experiments, and evaluator feedback into a closed-loop autoresearch process that develops non-trivial training recipes.
- autoresearch-everywhere - Cross-platform expansion: optimizes Karpathy’s autoresearch for broader hardware and environment support while preserving the fixed-budget keep/revert core loop.
- awesome-autoresearch - Curated knowledge base: a parallel community-curated list of AutoResearch use cases with optimization traces and open-source implementations at 915+ stars.
- Awesome-Auto-Research-Tools - Curated tool index: catalogs Auto-Research tools, skill frameworks, platform plugins, and benchmark harnesses across the broader ecosystem.
- Autoharness - Agent harness optimization: lets agents explore changes to prompts, configs, and runtime context, run evals, and keep only improvements, achieving benchmark lifts up to 40.7%.
- Thoth - Orchestration runtime: provides a dashboard-first control plane for autoresearch that turns drifting agent work into durable runs, locked work items, and reviewable verdicts across Claude Code and Codex.
- GEPAResearch - Plugin for agentic frameworks: optimizes code using the Genetic-Pareto LLM-driven search algorithm with isolated git worktrees, gated evaluation, and Pareto frontier candidate tracking for keep/discard decisions.
- Codex Autoresearch - Codex plugin: wraps optimization goals into measured experiment packets with benchmark contracts, durable session files, evidence trails, and live dashboards for keep/discard review.
- Cursor Autoresearch - Cursor and VS Code extension: ports pi-autoresearch’s MCP-driven optimization loop into a code editor with JSONL tracking, browser dashboard, and agent skills for metric-driven keep/discard workflows.
- interpretable-autoresearch - Governance infrastructure: makes autoresearch agent behavior transparent and auditable through behavioral specifications, winning 2nd place at the MIT Spring 2026 Hackathon.
- skills-autoresearch-flue - Evaluation infrastructure: provides a Flue agent harness that evaluates a seed skill against project fixtures, asks a researcher model to improve the skill, then reruns evals against the candidate.
- autoresearch-plugin - Research pipeline plugin: packages a full autonomous research pipeline as a Claude Code plugin with 65 skills across 9 families, guiding a direction from vague idea to submission-ready paper.
- research-program-skill - Research workflow skill: combines ml-intern’s discovery discipline with autoresearch’s baseline-first, metric-first loop into a portable skill for Claude Code, Codex CLI, and Cursor Agent.
- autoimprove - Codebase improvement infra: generalizes Karpathy’s autoresearch for Claude Code and Codex with git worktree isolation, configurable improvement strategies, and keep/discard experiment logs.
- harvester - Code maintenance infra: applies the Karpathy autoresearch pattern to watch repos, surface improvement opportunities as GitHub issues, execute approved changes overnight via Claude Code, and produce draft PRs for morning review.
- Autoresearch - Parallel experimentation framework: spawns supportive and adversarial Claude Code workers to collect evidence for or against research directions, with a judge scoring write-ups and curating the roadmap through iterative convergence detection.
- pi-autoresearch-harness - pi extension: forks pi-autoresearch with worktree-isolated experiment harness, adding init/run/log experiment tools and a slash-command dashboard for metric-driven keep/discard loops.
- Spark Researcher - Lightweight autoresearch framework: blends Karpathy’s compact loop with domain chips, collective intelligence network, and bounded self-editing for disciplined local experiment management.
- EvE: Evolutionary Ensemble of Agents - Multi-agent evolution framework: wraps coding agents into a decentralized evolutionary ensemble that co-evolves solver code and agent guidance through autoresearch-style competition, published as arXiv:2605.09018.
- deep-evolve - Claude Code plugin: generalizes autoresearch from ML training to any software project with automatic evaluation harness generation, journal-based crash recovery, and self-evolutionary outer-loop strategy improvement.
- AutoResearch-AI - Autoresearch infra: provides an npm CLI that installs a durable research harness into ML repos so agents can inspect code, propose experiments, run baselines, and continue loops without losing context across sessions.
- autoresearch - Autoresearch infra: provides a self-driving experiment sweep loop with daemon-detached
autoresearch.py, live PR-updating progress charts, and GPU monitoring for ML training projects. - pi-lifeline - Autoresearch infra: pi extension that lets smaller autoresearch models request strategic guidance from a stronger advisor model when stuck on plateaus, implementing rate-limited escalation with configurable consecutive-failure and plateau-run thresholds.
- autoresearch-skill-Andrej-Karpathy - Claude Code skill: turns natural-language optimization goals into autonomous goal-parsing, baseline-capture, and keep/discard loops with real-data validation, regression protection, and a reported 76% P99 latency reduction on a live signal-trading pipeline.
- bob-autoresearch - IBM Bob Shell adaptation: ports Karpathy’s autoresearch pattern to IBM Bob Shell as a Snake AI evolution demo that autonomously improves from random play to Hamilton-path play over 10-15 keep/discard iterations.
- claude-code-nightly - Claude Code substrate improvement: runs a Karpathy-style autoresearch loop that mines session history into personal evaluation suites and autonomously improves CLAUDE.md, hooks, skills, and memory overnight with git-backed keep/revert decisions.
Related Practices / Discussions
Source file: categories/related-practices-discussions.md
Trading / markets
- mary on adapting autoresearch into AutoPredict for prediction markets - X: describes a framework that applies evaluation + mutation + selection loops to prediction-market trading agents across weather, finance, and politics.
- 0xAggelos on overnight autoresearch across multiple exchanges - X: claims to apply Karpathy’s autoresearch pattern to trading strategy research across exchanges and symbols with thousands of experiments and no manual tuning.
- alex on verifier-driven autoresearch loops for Amazon seller workflows - X: argues that verifier-driven autoresearch loops fit PPC, catalog, and ASIN-level decisions better than peer-consensus swarms because proposals are kept only when the evaluator approves them.
- Brandon Pizzacalla on agentic paid ads systems based on Karpathy autoresearch - X: claims to use Karpathy-style autoresearch as the planning and experiment pattern behind autonomous paid-ads workflows.
- nlethetech on a NEPSE quant terminal refined through 300+ autoresearch cycles - X: says his NEPSE terminal trading workstation backtests strategies with a quant model refined through 300+ autoresearch cycles while combining market data and execution in one interface.
- tensorqt on Paradigm’s autoresearch hackathon for market making - X: describes a Paradigm challenge that treated prediction-market market making as an autoresearch problem and says top teams scaled the search with parallel Claude Code or Codex workers plus internal orchestration.
- xmal on using autoresearch in Naive-Power Law Blend market forecasting - X: says a new Naive-Power Law Blend asset-forecast study applied Karpathy’s autoresearch framework, found the loop effective, and exposed overfitting in the initial setup.
- Karpathy autoresearch loop driving a HMM + GEM ensemble - Reddit: describes applying Karpathy’s autoresearch loop to a Hidden Markov Model and Graph Ensemble Mixture trading strategy with autonomous strategy iteration.
- I made Claude Code build trading strategies — and built an adversarial harness to stop it from cheating - Reddit: describes applying Karpathy’s autoresearch to crypto strategy design, then building an adversarial verifier when the agent learned to game backtest scoring.
Business / GTM workflows
- Brandon Pizzacalla on applying Karpathy’s autoresearch pattern to cold email - X: describes a single-agent loop that tests cold-email variants on live reply-rate metrics, commits winners as the new baseline, and stops at convergence.
- Karpathy’s AutoResearch: How an ML Tool Became a Marketing Optimizer - Blog: reports that Ole Lehmann adapted the autoresearch loop to landing-page copy optimization, lifting a Claude copywriting skill from 56% to 92% checklist pass rate overnight for ~$15, and also applied it to website speed (1100ms → 67ms) and cold outreach.
- GoatGaucho on applying Karpathy’s autoresearch to TrustLayer trust scoring - X: reports using a mutable scoring config plus verifier on 120 labeled agents to push sybil-risk F1 from 0% to 95.9% in two keep-or-revert experiments before shipping the new thresholds to production.
- I used Karpathy’s AutoResearch while building my app - got 80% better AI outputs. Here’s exactly how I applied it outside ML - Reddit: says a Farcast GTM tool used an autoresearch-style ICP and channel loop against real data, keeping only more specific outreach and community recommendations and reporting roughly 80% better outputs.
- Applied autoresearch to Marketing Mix Modeling — 12x improvement, beat Google Meridian - GitHub Discussion: reports applying Karpathy’s autoresearch loop to Marketing Mix Modeling (MMM) and achieving 12x lift over Google Meridian by iterating on model components with keep/discard decisions.
Workflow automation / consumer ops
- agrim singh on applying autoresearch to business-class fare search - X: describes an
autofareloop that proposed 52 alternate routings and date shifts, validated visa and layover rules from YAML, and kept only cheaper legal itineraries, cutting one quoted business-class trip from $4,716 to $2,424 in 16 iterations.
Prompt / evaluation
- Aakash Gupta on eval design as the make-or-break constraint in autoresearch - X: argues that autoresearch only transfers cleanly when the evaluator is binary, locked, compact, and specific enough to resist gaming.
- 野口寛士 on nightly autoresearch-style improvement runs on a Mac mini - X: says his team lets a Mac mini run overnight improvement experiments, accepted 20 automatic wins in one night, and found success-metric definition matters more than experiment volume.
- Ali Amiri on matching Claude Code quality by optimizing prompts with AutoResearch - X: describes using AutoResearch to tune prompts and push a Qwen-based workflow from weak to strong performance on a large-repo task.
- Clement Hoang on Headway using autoresearch for mental-health-agent prompts - X: reports that Headway used Karpathy’s autoresearch to run about 80 generate → evaluate → analyze → mutate iterations on mental-health-agent prompts and reach 99% accuracy against a 95% target.
- Phil Chen on using GPT-5.5 for autoresearch-style Sales Ops harness optimization - X: reports using an autoresearch-style loop on an internal Sales Ops dataset and seeing strong hold-out gains, suggesting newer models can sustain longer benchmark-driven optimization runs.
- Ren on benchmark-driven skill optimization with autoresearch - X: Chinese discussion noting that autoresearch can be paired with skill creation so agents define a benchmark and then iteratively optimize the skill against it.
- Aakash Gupta on porting autoresearch to prompt engineering - X: describes mapping autoresearch into a four-role prompt-optimization loop with a locked eval script, binary rubric, results log, and overnight iteration.
- SonnyClawAI on adversarial second-pass evals in autoresearch - X: reports that a 4-hour write → eval → revise → measure loop only surfaced real failures after adding a second adversarial evaluation pass, making evaluator disagreement itself the useful signal.
- jcyhc_ai on citation hallucinations slipping through the Autoreason evaluator stack - X: shows that nine passes of critics and judges still missed fabricated references in the Autoreason paper, highlighting evaluator blind spots in autoresearch-style research-writing loops.
- Barna on benchmark-coupled correctness gates for zk-autoresearch - X: reports adding a benchmark-scale bitwise validator and nondeterminism detection because small upstream tests missed bugs that only appeared at the exact workload being optimized.
- vincent_dalmaso on treating skills like products with autoresearch - X: describes running an autoresearch-skill loop with a baseline, one change at a time, and score-based keep decisions to improve skill behavior instead of rewriting prompts blindly.
- How I Built a Skill That Makes All My Other Skills Better (Using Karpathy’s Autoresearch) - Blog: describes a meta-skill that builds rubrics, converts weak dimensions into binary evals, and runs keep/discard loops to improve other Claude skills autonomously.
- Shrinking a Production Prompt by 28% With Autonomous Optimization - Blog: uses uditgoenka/autoresearch to cut a reference-data matrix by 28% over 65 autonomous iterations while maintaining 98.1% output quality through modify → measure → keep/revert loops.
- AutoResearch for Claude Skills: the two-command loop that finds the gaps, rewrites the file, and keeps only the wins - Blog: describes a self-improving loop for Claude Skills that scores outputs against evolving rules, surfaces unwritten beliefs, adds rules when scores rise, and lifted one skill from 83% to 100% pass rate in two iterations.
- Compressing Prompts with an Autoresearch Loop - Blog: compresses leaked production system prompts from Cursor, v0, Devin, and Lovable across 55 hypothesis-test iterations, deriving a practical taxonomy of what can be cut without breaking behavior.
- 0xjialin on autoresearch finding no headroom in a security-review skill - X: reports adapting autoresearch into an OpenClaw and Claude Code skill to test 10 key dimensions of the SlowMist security-review skill and finding no meaningful optimization space left.
- snapolino on agents cheating when
prepare.pyis visible - X: reports that a nanochat-style autoresearch loop gamed a combinedval_bpb + trainable tokensmetric by narrowing the softmax target set once the agent could readprepare.py, arguing for hidden or locked evaluators. - Auto-research applied to a skill/prompt improvement - Blog: describes using an auto-research loop to test and improve a unit-test discovery skill against repeatable evals, proposing and scoring prompt changes to determine whether the skill performs as intended.
Software / code workflows
- 0xViviennn on adapting autoresearch to GitHub engineering workflows - X: describes a Chinese engineering adaptation of autoresearch with verifier-gated code cleanup, bug fixing, test additions, and rollback across numbered cycles.
- Dominic Elm on the three-file autoresearch loop - X: summarizes autoresearch as propose → train → check → keep or revert → repeat and frames the instruction file as the new bottleneck.
- Franci Penov on running overnight queues of autoresearch experiments - X: describes operating multiple machines that run queued autoresearch experiments, logs, summaries, and findings overnight.
- xfu on AutoResearch bug-finding persistence - X: short Chinese observation that autoresearch is unusually strong at repeatedly hunting hidden bugs until it finds them.
- Kyle Boddy on using autoresearch-claude-code to tune inference hardware parameters - X: describes using an autoresearch Claude Code setup to tune testing inference machine parameters on 3090-based hardware.
- Shann Holmberg on setting up autoresearch with Claude Code - X: tutorial thread that frames autoresearch as one file, one metric, and autonomous keep/revert looping inside Claude Code.
- Bob on autoresearch wasting 11 days on a solved benchmark - X: highlights that a loop can stay mechanically healthy yet keep burning compute after metric saturation if it lacks usefulness-aware stop criteria.
- Doğaç on GPU-kernel autoresearch needing human nudges and repeated validation - X: argues that human steering and rerunning experiments help autoresearch escape local minima and debunk noisy wins in kernel optimization.
- Barna on model exhaustion signals in zk-autoresearch - X: reports that once Sonnet kept circling previously eliminated NTT ideas in the iteration memory, he treated that repetition as an exhaustion signal and switched models.
- Isaac Kargar on using autoresearch to improve another AI agent - X: reports giving Claude Code one agent’s codebase, running 24 keep/discard experiments, improving memory quality by 41%, and later refactoring after the loop showed the original bottleneck hypothesis was wrong.
- Google hit: “Karpathy’s autoresearch applied to debugging” - Reddit: discussion thread about adapting the autoresearch pattern to debugging and validation-driven repair loops.
- Show HN: Karpathy’s autoresearch applied to debugging - two open-source skills - Hacker News: describes applying the one-file, one-metric, keep/discard autoresearch pattern to root-cause analysis and bug-fixing skills for silent failures.
- Google hit: “Autoresearch with Claude on a real codebase (not ML)” - Reddit: discussion thread about applying the autoresearch pattern to a production codebase rather than an ML training script.
- Google hit: “I used Karpathy’s autoresearch pattern on product workflows” - Reddit: discussion thread suggesting the autoresearch loop is being adapted into broader product and self-improving agent workflows.
- Google hit: “Autoresearch with Claude on a real codebase (not ML training)” - Reddit: discussion thread emphasizing high failure rates and many discarded experiments as normal when autoresearch is applied to real production codebases.
- Google hit: “I adapted Karpathy’s autoresearch to build an auto-improvement loop for agentic coding skills” - Reddit: discussion thread about applying the autoresearch loop to iterative improvement of coding-agent skills.
- AutoResearch for Codex - Reddit: presents a Codex SDK optimizer that fans out candidate branches, benchmarks them under correctness tests, clones survivors, and reports about 33% inference gains plus 16× algorithm speedups.
- vdaubry on generalizing autoresearch beyond codebases - X: argues that the core autoresearch pattern can generalize from code optimization to load testing, landing page A/B tests, and infrastructure tuning when the benchmark is clear.
- Alex C. on turning autoresearch into a bounded debugging loop skill - X: describes deriving a bounded-experiment-loop skill from autoresearch and using it to improve agent debugging and code fixing.
- kavindpadi on using pi-autoresearch for SQL optimization - X: describes trying pi-autoresearch on intentionally inefficient BigQuery SQL and suggests the same metric-driven loop can target top-cost warehouse queries under platform-specific pricing constraints.
- Kr1sso on turning Instruments into a teammate for autoresearch - X: describes turning macOS Instruments into an LLM-ready profiling CLI so autoresearch loops can test CPU, Metal GPU, and memory hypotheses and keep or discard optimizations from trace data.
- Autoresearching Ruby Performance with LLMs - Conference talk: says LLM agents, reproducible benchmarks, skills, and MCPs can be combined into an autoresearch workbench for Ruby and Rails performance tuning while surfacing the loop’s practical limits.
- Brian on using autoresearch to build and deploy a Raspberry Pi ML model - X: reports six rounds of architecture search plus custom C inference, Zig cross-compilation, and SSH deployment to a Raspberry Pi 4, cutting p50 latency from 0.488 ms to 0.028 ms in about two hours.
- Darrell Thomas on an RTX 5090 CUDA kernel factory inspired by autoresearch - X: reports an AI loop that runs Nsight Compute, tunes kernels, and keeps or discards 39 CUDA variants, with DSYRK reaching 2.19× cuBLAS and quantum simulation running 2-5× faster than cuQuantum.
- abhijitmjj on a 13-hour autoresearch loop for a LaTeX scanner - X: reports using Karpathy’s keep/revert loop on a Markdown-to-LaTeX scanner, growing a 59-fixture corpus across 29 iterations to lift F1 from 0.896 to 1.0 while cutting real-world false positives by 78%.
- latentsea on autoresearch improving a SaaS-building harness against a
time-to-Realworldbenchmark - Hacker News: says autoresearch improved a Claude-driven SaaS harness that builds RealWorld implementations under 90-minute budgets and scores them on test pass counts, harness quality, and completion time. - Building an Internal Research Agent for AutoResearch in Claude Code - Blog: describes shipping autoresearch as a standalone Claude Code skill and later adding an orchestrator layer after a month of production use revealed it was only half the system.
- Feature: Autoresearch Skill — Autonomous Git-Based Experiment Loop - GitHub issue: proposes adding a Karpathy-style autoresearch loop to the Hermes agent framework with git-backed keep/revert decisions and crash recovery.
- Autoresearch isn’t just for training models - Blog: Shopify engineer Dave recounts building pi-autoresearch with Tobi Lütke, open-sourcing it, and the internal #autoresearch-wins channel that has since logged unit tests 300× faster, builds 65% faster, and pnpm itself becoming faster through autonomous keep/discard loops.
- Autoresearch Beyond ML: Applying Karpathy’s Optimization Pattern to RAG Retrieval - Blog: adapts the autoresearch keep/discard loop to a production RAG knowledge base, mutating metadata and content one fix at a time and lifting retrieval pass rate from 22% to 89% while surfacing silent ontology-drift failures.
- The Dark Factory Harness: From Autonomous Hill-Climbing to Autonomous Research - Blog: adds research context, research memory, and research strategy layers around the autoresearch loop to prevent random-walk behavior and help agents learn from experiment trajectories after 20+ iterations.
- Two-phase Autoresearch Chain - Reddit: describes a structured two-phase approach to autoresearch with explicit planning and execution phases for more principled experiment design.
- My first post scored 1. Karpathy’s autoresearch idea helped me repost - Hacker News: reports using Karpathy’s autoresearch to optimize an HN post itself, treating upvote ratio as the metric and iterating on content through a keep/discard loop.
Scientific / research augmentation
- Karpathy’s autoresearch could make scientists of us all - Podcast: Azeem Azhar explores applying Karpathy’s loop architecture to refine worldviews, test arguments, and solve business problems beyond ML, arguing the pattern makes autonomous scientific experimentation cheap and accessible.
- Makoto Tanji on Karpathy’s AutoResearch - X: explains Karpathy’s autoresearch as an evolutionary search loop and highlights emerging multi-agent and parallel extensions.
- Duy Nguyen on Karpathy, Tobi, and generalized autoresearch loops - X: summarizes autoresearch as edit → evaluate → keep/discard → repeat and points to its use in both GPT training and Shopify Liquid optimization.
- Darian Parrish on smaller autoresearch loops for non-coding tasks - X: mentions adapting the autoresearch pattern beyond coding into other task types.
- Ilya on adapting autoresearch to energy-demand peak prediction - X: describes packaging a custom scorer and composite metric so Claude Code could autoresearch seasonal calibrations and threshold interactions for energy demand prediction.
- Google hit: “I built an autonomous ML agent that runs experiments on tabular data indefinitely” - Reddit: discussion thread describing a Claude Code setup that applies autoresearch loops to churn and conversion tabular tasks by iterating on features, models, and hyperparameters until gains plateau.
- Is ‘autoresearch’ adaptable to LoRA training? - Reddit: discussion thread exploring whether Karpathy’s autoresearch loop can be applied to Stable Diffusion LoRA fine-tuning with image-quality evaluation as the keep/discard metric.
- Gemma 4 Uncensored (autoresearch results) - Reddit: describes an automated research loop that ran 22 Gemma 4 uncensoring experiments, fixed false-positive refusal markers, and escalated from dense-only abliteration to expert-granular MoE edits when results stalled.
- Dan Woods on using autoresearch plus Apple’s “LLM in a Flash” to run Qwen3.5-397B locally - X: says Claude Code used Karpathy’s autoresearch setup plus Apple’s “LLM in a Flash” paper to get Qwen3.5-397B running on a 48 GB M3 Max MacBook.
- Joe Harris on building the same autoresearch idea for robotics teams - X: claims his team built an analogous loop for robotics teams to automate experiment plumbing, debugging, and iteration around harder physical-world workflows.
- David Gasquez on using the autoresearch idea in ML competitions - X: says the benchmark-driven autoresearch pattern worked in a couple of ML competition settings and generalizes to scored tasks like retrieval, AUC, and performance tuning.
- Dylan Huang on a 108-experiment autoresearch run for golf forecasting - X: reports letting Claude Code run 108 no-human-loop experiments on Tinker to build a golf forecasting system, cutting held-out tournament log-loss from 2.81 to 0.54 while reverting 52% of trials.
- Applying Karpathy’s autoresearch to a 33M-token public transit dataset - Reddit: reports a 14% improvement on a 33-million-token public transit dataset using Karpathy-style autoresearch with replication notes.
Infra / benchmarking ideas
- Versur on bringing autoresearch-style loops to Grasshopper solver workflows - X: describes using candidate generation, fixed benchmarks, scoring, and keep-only-if-improved loops for computational design experiments.
- Google hit: “Autoresearch-style framework for improving heuristics” - Reddit: discussion thread about applying autoresearch-style benchmarked improvement loops to optimization heuristics under strict solver budgets.
- Google hit: HN thread on applying autoresearch to LLM inference - Hacker News: discussion pointing to autoresearch-style ideas being adapted from model training to LLM inference optimization.
- From traces to self-improving agents - Blog: describes Metronis Aegis as a closed-loop agent-improvement stack where traces become evals, weak slices become RL environments, useful trajectories write back to memory, and release depends on held-out proof.
- AutoResearch vs Classical Hyperparameter Tuning - Blog: reports a NanoChat head-to-head where autoresearch beats Optuna on sample efficiency, cost-adjusted results, and longer-horizon generalization by escaping a fixed search space.
- Autoresearch Hub - Hacker News: Karpathy describes an unreleased swarm design where trusted workers verify improvements from a larger untrusted pool to parallelize autoresearch with leaderboard-style proof of progress.
- Greyforge audit of autoresearch architecture - GitHub Discussion: Greyforge Labs publishes a systems audit concluding autoresearch is a clean demo loop but not a serious research architecture, identifying gaps in supervision, routing, memory discipline, artifact durability, and operator control that they are building further up the stack.
- Andrej Karpathy on AutoResearch, AI Agents, and Why He Stopped Writing Code (No Priors Interview) - Podcast: Karpathy describes the “loopy era” of autonomous systems running experiments, training models, and self-improving without human coding, including details on how he uses autoresearch for 16 hours a day.
- Autoresearch-Style Autonomous Prompt Optimization for PlanExe - Proposal: adapts Karpathy’s autoresearch pattern to autonomous prompt-template optimization, mutating one template at a time, regenerating affected outputs, and scoring results overnight.
- autoresearch on Loooom: Teaching a Skill to Improve Itself - Blog: dogfoods the autoresearch pattern on the learn-anything skill, lifting it from 0% to 87.5% eval pass rate in two iterations and identifying that deterministic evals, golden responses, and eval versioning are critical for LLM-based keep/discard loops.
- Autoresearch: The Overnight Loop That Changed the Production Function - Blog: synthesizes the autoresearch cascade from Karpathy’s original loop through financial markets, chess, and rendering, identifying the fixed clock, honest fitness function, and cost convergence as the structural conditions behind the pattern’s rapid spread across domains.
- 拆解 AutoResearch:630 行代码,一晚上百次实验 - Blog (Chinese): a technical deep-dive into Karpathy’s 630-line autoresearch codebase, explaining the fixed-time loop, three-file architecture, and why minimalist design enables overnight autonomous experimentation.
- How to stop your autoresearch loop from cheating - Blog: Cerebras reports an overnight loop that abandoned the intended experiment and started its own, surfacing evaluator gaming as a first-order failure mode when the agent can read the evaluation setup.
- Autoresearch and Context Rot — How a Stateless Agent Loop Avoids Memory Problems - Blog: analyzes how autoresearch’s stateless git keep/revert design sidesteps context-window rot and identifies where it still breaks when experiments require long-horizon reasoning.
- The Karpathy Loop: 700 experiments, 2 days, and a glimpse of where AI is heading - Fortune: mainstream coverage of Karpathy running 700 autonomous experiments over two days, framing the keep/discard pattern as a shift in how AI research gets done.
- koylanai on building an autonomous autoresearch loop for Agent Skills and Context Engineering - X: describes building a Karpathy-inspired autonomous research loop that discovers sources, scores them with rubrics, and drafts skill updates for context engineering, directly referencing authorsearch and identifying where repos need stronger validation.
- AutoResearch 发布满月:社区如何将其从“最小闭环”进化为“通用科研框架” - Blog (Chinese): reviews how the community evolved Karpathy’s minimal loop into a multi-agent, cloud-deployable general research framework within one month of release.
- The Autoresearch Ecosystem: How One Repo Spawned 9 Different Types of AI Projects - Blog: maps the autoresearch ecosystem into nine distinct project categories and traces how one 630-line script generated an entire optimization toolchain.
- The Loop Is Only as Good as the Metric - Blog: argues that autoresearch success hinges on metric design and shows how weak metrics produce spurious wins, useless outputs, or infinite loops.
- Ran autoresearch with and without access to 2M CS papers - Reddit: a controlled experiment where the agent with paper access found techniques outside Claude’s training data, suggesting knowledge retrieval as a strong complement to keep/discard loops.
- Auto-RAG-Optimizer: Applying AutoResearch to Autonomous RAG Tuning - Blog: applies autoresearch to RAG pipeline tuning by editing retrieval parameters and keeping only configurations that lift answer accuracy against a fixed eval set.
- Multi-Agent Autoresearch: Automating ML Optimization with Open Source AI - Blog: describes a multi-agent autoresearch setup where specialized agents handle proposal, evaluation, memory, and orchestration roles across ML optimization tasks.
- Autoresearch System Architecture for ML Claim Validation - Wiki: proposes a system architecture that uses autoresearch to validate ML research claims by having agents attempt to reproduce and stress-test reported results in controlled loops.
- Adapted autoresearch for adversarial protocol hardening - GitHub Discussion: applies the autoresearch pattern to finding invariant violations in a cryptographic delegation protocol, treating the agent as an adversarial tester that proposes protocol changes and keeps only those that break security guarantees.
- Autoresearch on Steroids with Sandboxes - Blog: demonstrates using sandboxed environments as the execution layer for autoresearch loops with isolation, parallelism, and pre-warmed filesystems to run candidates faster and more reliably.
- AI4Scientist/awesome-autoresearch - GitHub curated list: another community-maintained list of autonomous researcher frameworks with 105+ stars, signaling the ecosystem has grown large enough to sustain multiple curated indexes.
Software / code workflows (continued)
- Autoresearch Is Reward Function Design - Blog: applies pi-autoresearch to optimize a Python finance scoring pipeline across 49 experiments with keep/discard, cutting p95 latency from 339ms to 34ms and arguing that autoresearch success hinges on reward-function design quality rather than the loop mechanics.
Infra / benchmarking ideas (continued)
- Auto-research: The Lab That Runs While You Sleep - Blog: discusses how automated autoresearch feedback loops shift the bottleneck from running experiments to deciding what to optimize for, arguing that metric definition is the binding constraint on loop effectiveness.
- Where Autoresearch Breaks (and Why That’s the Point) - Blog: argues that cheap verification (single-number comparison) is the core mechanism making autoresearch viable, and identifies which domains break when verification requires expensive human judgment.
- Research Worlds: governing long-running autoresearch agents - GitHub Discussion: proposes a “world file” pattern that defines the research environment (datasets, metrics, constraints, budgets) so agents stay governed even when context drifts across hours or days.
- Autoresearch & Self-Improving Agent Loops: Community Research Report - Research report: compiles community findings from YouTube, Reddit, and GitHub Discussions on Karpathy’s autoresearch and self-improving agent loops as of April 2026.
- MinLiBuilds on Anthropic’s Outcomes feature as autorsearch engineering - X: notes that Anthropic’s new Outcomes (Codex /goal) feature is essentially autoresearch engineered into the product, signaling mainstream coding-agent platforms adopting the keep/discard loop pattern natively.
- Tobi Lütke on local models phoning a stronger advisor for autoresearch - X: observes that local models can run pi-autoresearch effectively when they occasionally ask a stronger model for strategy ideas, inspiring the phone-a-friend pattern for multi-model optimization loops.
- TokenTelemetry — Track token cost and reasoning traces across autoresearch runs - GitHub Discussion: introduces a 100% local, open-source tool for tracking token costs and reasoning traces across autoresearch runs, surfacing expenditure patterns to inform cost-aware experiment design.
- Is autoresearch really better than classic hyperparameter tuning? - Reddit: discussion comparing Karpathy-style autoresearch loops with classical hyperparameter tuning methods on the nanochat benchmark, examining whether code-editing agents genuinely outperform structured HPO.
Knowledge Base / RAG Preparation
Source file: categories/knowledge-base-rag-preparation.md
- autoresearch-genealogy - Genealogy: uses Claude Code
/autoresearchprompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault. - AutoRAGsearch - RAG retrieval optimization: applies an autoresearch-style loop to a fixed QA benchmark by editing only
rag_pipeline.py, running local retrieval experiments, and improving retrieval_score from 0.9472 to 0.9867 over 20 autonomous experiments. - claude-obsidian - Knowledge-base preparation: uses a Karpathy-style
/autoresearchskill to gather sources, fill research gaps across three search rounds, and write structured source, concept, entity, and synthesis pages into a retrieval-ready Obsidian vault. - AutoRAG Research Lab - RAG pipeline optimization: runs autonomous agents that plan, implement, benchmark, and keep/discard retrieval pipeline changes using Karpathy-mode over EnterpriseRAG-Bench with Docker-sandboxed evaluation and semantic no-op detection.
Market Research
Source file: categories/market-research.md
- atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.
Workflow Automation
Source file: categories/workflow-automation.md
- AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
- Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
- AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
- autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
- PM document optimizer - Product workflow automation: applies a Karpathy-style git ratchet to markdown artifacts like PRDs and strategy docs, scoring each draft with programmatic checks and committing only higher-scoring revisions.
- Trip Optimizer Pro - Travel planning workflow automation: applies the autoresearch pattern to itinerary generation by researching destinations, scoring multi-day plans, and keeping only itinerary mutations that improve a weighted travel-quality score.
- Autoresearch for Software Development - Software delivery automation: adapts autoresearch to GitHub Issues by rotating agents through implement → review → fix loops, then auto-merging only issues that clear a score gate plus build, lint, and test checks.
- autobrowse - Browser workflow automation: applies an autoresearch loop to browser tasks by iterating on
strategy.md, replaying tasks with Browserbase, and graduating only reliable workflows into reusable Claude Code skills. - Google Play Description Autoresearcher - ASO optimization: applies a Karpathy-style autoresearch loop to Google Play Store long descriptions, running ~25 keyword-density experiments per session and committing only description variants that improve keyword match scores.
Submission format
Use exactly one line per entry:
- [Name](URL) - Industry: one-sentence description of the autoresearch use case.
How to contribute
- Pick the category that best matches the direct autoresearch application domain.
- Add a single-line entry in the required format to the category file, not directly to the README aggregate.
- Keep the summary concrete and scannable.
- Prefer examples that clearly show scenario + autoresearch loop + value.
See CONTRIBUTING.md for details.
License
MIT
相似文章
@yibie: awesome-autoresearch 更新,新增 7 条目。 pi-lifeline — 一个很实用的 pi 扩展:让本地小模型跑 autoresearch,卡在瓶颈时自动向更强的模型求助。Tobi Lütke 也在 X 上提到了这个…
awesome-autoresearch 列表更新,新增7个条目,包括 pi-lifeline(本地小模型求助强模型)、Marketing Mix Modeling 12x 提升、TokenTelemetry 开源追踪工具等,覆盖多个行业的 autoresearch 应用案例。
@yibie: awesome-autoresearch 更新,新增 3 条目。 dreamworld — 世界模型研究。把 autoresearch 循环用到像素级世界模型训练上(CarRacing-v3),agent 可以在 tokenizer、dy…
awesome-autoresearch 更新,新增 dreamworld(世界模型研究)、Odyssey Engine(通用迭代引擎)、以及 Kirill Krainov 关于 agentic coding 自改进的文章。
@yibie: 本周 awesome-autoresearch 新增 3 个条目: 1. autoslam:将 Karpathy 的 autoresearch 循环应用于 LiDAR SLAM 方法设计,在 KITTI 基准上累积实验排行榜 2. Bir…
本周 awesome-autoresearch 新增三个条目,包括将 Karpathy 的 autoresearch 循环应用于 LiDAR SLAM 的 autoslam 项目,以及两篇分析原始实验并揭示指标博弈的博客文章。
@yibie: 本周 autoresearch 生态证据扫描:新增 9 条记录,总条目数达 383。 AutoResearch-RL:带 http://prepare.py/train.py 隔离的持续 RL 研究框架,支持 LLM/混合策略实验调度 l…
本周 autoresearch 生态新增 9 条记录,总条目达 383,涵盖 AutoResearch-RL 强化学习框架、lance-autoresearch 数据库内核优化、Clio 预测市场回测框架等多个开源工具和项目。
@yibie: 本轮新增 3 条: 1. Auto-Quant:将 Karpathy autoresearch 应用于 FreqTrade 加密货币策略回测,通过 keep/discard 循环在 5 个交易对上演化出多策略组合 2. Universal…
yibie 分享了 awesome-autoresearch 列表的三条新条目,涵盖自动量化交易、通用技能优化和 Claude Code 插件。