@yibie: awesome-autoresearch 更新,新增 7 条目。 pi-lifeline — 一个很实用的 pi 扩展:让本地小模型跑 autoresearch,卡在瓶颈时自动向更强的模型求助。Tobi Lütke 也在 X 上提到了这个…

X AI KOLs Timeline 工具

摘要

awesome-autoresearch 列表更新,新增7个条目,包括 pi-lifeline(本地小模型求助强模型)、Marketing Mix Modeling 12x 提升、TokenTelemetry 开源追踪工具等,覆盖多个行业的 autoresearch 应用案例。

awesome-autoresearch 更新,新增 7 条目。 pi-lifeline — 一个很实用的 pi 扩展:让本地小模型跑 autoresearch,卡在瓶颈时自动向更强的模型求助。Tobi Lütke 也在 X 上提到了这个 "phone-a-friend" 模式。 Marketing Mix Modeling 12x — 有人把 autoresearch 用在营销组合建模上,迭代模型组件,最终 lift 比 Google Meridian 高出 12 倍。karpathy/autoresearch Discussion #497。 automat — 材料科学研究。要求 agent 在提出每个材料描述符时必须给出科学理由,然后才跑实验验证。不是黑盒调参,而是逼 agent 用领域知识论证。 TokenTelemetry — 开源工具,追踪 autoresearch 跨 run 的 token 消耗和推理轨迹,让实验成本可视化。 另有 autoresearch-gym(机器人 RL)、autoresearch-skill-Andrej-Karpathy(76% 延迟优化)、Tobi 的 phone-a-friend 讨论等。 234 实现 + 111 讨论 = 345 条目
查看原文
查看缓存全文

缓存时间: 2026/05/20 06:26

awesome-autoresearch 更新,新增 7 条目。

pi-lifeline — 一个很实用的 pi 扩展:让本地小模型跑 autoresearch,卡在瓶颈时自动向更强的模型求助。Tobi Lütke 也在 X 上提到了这个 “phone-a-friend” 模式。

Marketing Mix Modeling 12x — 有人把 autoresearch 用在营销组合建模上,迭代模型组件,最终 lift 比 Google Meridian 高出 12 倍。karpathy/autoresearch Discussion #497。

automat — 材料科学研究。要求 agent 在提出每个材料描述符时必须给出科学理由,然后才跑实验验证。不是黑盒调参,而是逼 agent 用领域知识论证。

TokenTelemetry — 开源工具,追踪 autoresearch 跨 run 的 token 消耗和推理轨迹,让实验成本可视化。

另有 autoresearch-gym(机器人 RL)、autoresearch-skill-Andrej-Karpathy(76% 延迟优化)、Tobi 的 phone-a-friend 讨论等。

234 实现 + 111 讨论 = 345 条目


yibie/awesome-autoresearch

Source: https://github.com/yibie/awesome-autoresearch

awesome-autoresearch

A curated awesome list of public autoresearch use cases across industries.

This README is the homepage aggregate of the current category files, so the latest accepted entries are visible here without drilling into subpages.

The repository distinguishes between:

  • primary categories for stronger case evidence such as repos, project pages, and concrete write-ups
  • secondary overlap categories for cross-cutting patterns that reuse the same evidence from another angle
  • Related Practices / Discussions for credible public practice signals — especially X threads, Reddit discussions, and interviews — that describe real autoresearch usage even when no strong standalone case page exists yet.

Why this list

Most discussions about autoresearch are still scattered, vague, or overly tool-centric. This list is designed to answer two practical questions quickly:

  • Where has autoresearch already been used in real workflows?
  • Which patterns can transfer across industries?

This is not a comprehensive database. It is a high-signal, fast-scanning field guide.

Inclusion criteria

An entry should meet all of the following:

  • The source is public and citable.
  • The example is directly related to autoresearch, not just a generic research or monitoring agent.
  • The source explicitly mentions autoresearch, cites Karpathy’s autoresearch, or clearly shows a modify → verify → keep/discard → repeat loop.
  • The summary explains the scenario, method, and value in one sentence.

We do not include:

  • Generic research agents, monitoring agents, or multi-agent systems with no explicit autoresearch loop.
  • Pure theory or opinion without a concrete practice.
  • Generic AI commentary with no autoresearch workflow.
  • Long write-ups inside the list itself.
  • Sources that are private, inaccessible, or too vague to classify.

Current coverage

Primary categories

Secondary overlap categories

Open categories still being tracked

Some entries intentionally appear in more than one overlap category when the same project is both a domain case and a reusable workflow pattern.

Browse by category

Full list

Scientific Research

Source file: categories/scientific-research.md

  • AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
  • Sibyl Research System - Scientific research: builds a fully autonomous AI scientist on Claude Code with inner research-iteration loops and outer self-evolution across projects.
  • autoresearch-rl - RL research: applies the autoresearch pattern to RL post-training by iterating on one training config, running fixed-time experiments, and keeping only eval improvements.
  • autoresearch-robotics - Robotics research: adapts Karpathy-style autoresearch to MuJoCo and Gymnasium robotics tasks by editing one training file, evaluating fixed-budget runs, and using simulator renderings plus vision feedback to keep only better policies.
  • Tinker-Explorer - Evidence-retrieval research: adapts the autoresearch pattern to GRPO document exploration, comparing reward designs and keeping only retrieval policies that answer multihop questions more accurately under a token budget.
  • Autoresearch on an old research idea - Multimodal retrieval research: applies Claude Code autoresearch to an old eCLIP idea, running 42 fixed-budget experiments with commit/revert decisions and cutting mean rank from 344.68 to 157.43.
  • autoresearch-at-home - Distributed ML research: coordinates a SETI@home-style swarm of agents that claim experiments, share full train.py results through Ensue, and collectively drive down val_bpb across different GPUs.
  • autoresearch-paper-benchmark - Graph ML research: runs paper-driven campaigns on a fixed Peptides-func benchmark by editing train.py, logging 300-second experiments, and testing only the best validation-AP model at campaign end.
  • autoresearch-cifar10 - Vision research: applies autoresearch to CIFAR-10 ResNet training on a 3090, iterating under fixed time budgets and keeping changes that lift accuracy beyond a 91.89% baseline.
  • autoresearch-yolo - Vision research: adapts Karpathy’s one-file keep/revert loop to YOLOv8 on COCO128, editing only train.py, running 10-minute experiments, and raising mAP50-95 from 0.3820 to 0.8303 in the published results ledger.
  • AutoResearch-GenPose - Vision research: adapts autoresearch to CIFAR-10 UNet denoising by editing one training file, running fixed 5-minute experiments, and keeping only val_psnr improvements.
  • MLP-AutoResearch - MNIST training research: ports Karpathy’s single-file loop to an MLP classifier, fixing 20-epoch runs and greedy keep/revert decisions that raised handwritten-digit accuracy from 0.9809 to 0.9836.
  • autoresearch-medimage - Medical imaging research: adapts Karpathy’s prepare.py + train.py + results.tsv loop to 2D imaging tasks, using short-budget candidate discovery and staged follow-up validation to surface stronger ChestXray14 models.
  • Agentic MR sequence development - MRI sequence research: uses Agent4MR with physics-aware validation to let agents generate, refine, and autoresearch PyPulseq sequences for a fluid-suppressed spin-echo EPI challenge across model generations.
  • autocircuit - Analog circuit optimization: adapts Karpathy’s autoresearch to a SKY130 two-stage op-amp, editing optimize.py, running ngspice, and keeping only parameter changes that expand the GBW-versus-power Pareto front under phase-margin constraints.
  • Photonic Device Auto-Design Agent - Photonic design research: applies a Karpathy-inspired loop to one design.py, running layout previews, fabrication DRC, FDTD simulations, and keep/discard decisions to optimize devices such as splitters, tapers, and crossings.
  • fe-autoresearch - Tabular ML research: applies the autoresearch loop to LightGBM feature engineering on the UCI Bank Marketing dataset by editing one engineer_features() target, training against fixed AUC metrics, and keeping only improvements.
  • AutoFeaTune - Tabular ML research: edits only features.py, trains XGBoost under 5-fold CV composite scoring, and uses a git ratchet plus graph memory to keep or revert feature transforms over repeated autoresearch runs.
  • Paper Lantern improves Autoresearch - ML research augmentation: connects a 2M-paper MCP server to autoresearch, letting the agent cite 100 papers across 100 experiments and reach a 3.2% lower 2-hour validation loss than the same run without paper access.
  • Subtractive Search in a Mature Tabular Pipeline - Tabular ML research: applies Karpathy’s autoresearch to a churn-prediction XGBoost pipeline, running 116 autonomous experiments and lifting subsample AUC from 0.902892 to 0.916721 largely by removing noisy target-encoded features.
  • autoresearch-connect4 - Game AI research: adapts Karpathy’s three-file autoresearch loop to Connect Four by editing train.py, training 5-minute self-play runs, and keeping only changes that improve weighted win rate against fixed opponents.
  • autoresearch-tabular - Tabular ML research: adapts Karpathy’s three-file loop to the Adult Income benchmark by editing only train.py, running fixed 2-minute experiments, and keeping only val_auc improvements.
  • ocr-autoresearch - OCR research: adapts autoresearch to ICDAR2015 scene-text recognition by editing one train.py, running fixed 5-minute CRNN+CTC experiments, and keeping only lower validation character error rates.
  • xgboost-autoresearch - XGBoost model optimization: provides a reusable template and a concrete end-to-end example that autonomously iterates on feature engineering and XGBoost hyperparameters for tabular datasets, actively searching the web for domain knowledge and ML best practices to inform each experiment, and delivering sustained AUC improvements across successive iterations.
  • Tennis XGBoost Autoresearch - Sports analytics research: applies a Karpathy-style keep/revert loop to a 245K-match tennis XGBoost pipeline, then hardens the evaluator after the agent learned to game mutable ROC-AUC scoring.
  • Bio-Autoresearch - Drug discovery research: applies a Karpathy-style autoresearch loop to rare-disease drug repurposing on PrimeKG, running 15 GPU experiments with keep/revert decisions and lifting held-out per-disease AUPRC from 0.284 to 0.761.
  • autoresearch-quantum - Quantum research: runs incumbent/challenger autoresearch ratchets for encoded magic-state experiments, screens candidates on cheap noisy simulations, and promotes only justified challengers to expensive backends while logging transferable lessons.
  • kaggle-autoresearch - Tabular ML research: adapts Karpathy-style autoresearch to Kaggle competitions such as Titanic, House Prices, and Store Sales by iterating on feature and model code, logging approved baselines, and accepting only cross-validation improvements over fixed thresholds.
  • MiniMax M2.7: Early Echoes of Self-Evolution - AI-lab research: describes an internal research agent that automates 30%-50% of RL workflow and a 100+ round keep/revert scaffold-optimization loop that improved internal evaluation scores by 30%.
  • autoresearch-macro - Macroeconomic forecasting research: runs LLM-guided outer-loop search over Chronos-2 covariates, transforms, and fine-tuning settings, keeping only validation-era forecast improvements across pseudo-real-time Norway, Canada, and Sweden benchmarks.
  • autoresearch-dqn - RL algorithm research: applies the autoresearch loop to a CartPole training script, logging 39 iterations that replaced an unstable DQN baseline with a REINFORCE agent that reaches reward 500 in about 5 seconds instead of about 3 minutes.
  • AutoMedal - Kaggle competition research: adapts Karpathy’s keep/revert loop into strategist, researcher, and experimenter phases, journaling 24 tabular-competition experiments and keeping only lower val_loss changes on a fixed leaderboard-oriented harness.
  • autoresearch-qwen - Document VQA research: adapts Karpathy’s keep/discard loop to Qwen3-VL on the official DocVQA benchmark by fixing evaluate.py, limiting edits to train.py, and accepting only higher full-validation ANLS scores.
  • Wine Variety Prediction with LLMs - ML fine-tuning research: uses a Karpathy-inspired agent loop to edit one LoRA YAML config, parse HPSEARCH_RESULT accuracy from short MLX runs, log each trial in JSONL, and revert worse hyperparameter changes on the wine-classification benchmark.
  • Autoresearch for Research - Physics research: uses an LLM to hypothesize, implement, run, and interpret PDE experiments across five domains (Navier-Stokes, Kuramoto-Sivashinsky, Gray-Scott, NLS, CGLE), achieving 7-24× score improvements with no problem-specific training.
  • ScaleAutoResearch-Ramsey - Mathematics research: scales autoresearch to Ramsey number computation, discovering new lower bounds R(3,17)≥93 and R(4,15)≥160 that improved 32-year-old records after even DeepMind’s AlphaEvolve could not.
  • bountyhunter: Vesuvius Autoresearch - Scientific research: deploys the first autonomous research swarm for the Vesuvius Challenge, running a full CV research loop with hypothesis generation, hyperparameter optimization, model training, and keep/discard decisions to uncover optimal configurations for reading ancient carbonized scrolls.
  • AutoGo - Game AI research: applies Karpathy’s autoresearch to training a Go-playing AI from scratch, using cheap self-play data and fixed-time eval to keep only network architecture and training improvements across iterations.
  • autoresearch-speedrun - NanoGPT speedrun research: applies Karpathy-style autoresearch with a two-stage paired-seed acceptance funnel and a curated literature-backed idea pool, cutting val_loss from 3.9249 to 3.8093 across 30 experiments with 6 accepted changes out of ~24 honest discards.
  • autoresearch-gym - Robotics RL research: provides a Gymnasium workbench for agent-driven RL research loops under fixed benchmark contracts with live dashboard and deterministic evaluation, following the Karpathy autoresearch pattern with candidate training and metric inspection.
  • automat - Materials science research: applies Karpathy’s autoresearch to design composition-only descriptors for materials regression tasks, requiring the agent to justify scientific reasoning behind each proposed descriptor before evaluating against held-out splits.

Software / Systems Optimization

Source file: categories/software-systems-optimization.md

  • karpathy/autoresearch - ML training optimization: the original autoresearch loop where an agent edits a GPT training script, runs fixed-time experiments, and keeps only improvements in validation bits-per-byte.
  • AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
  • autoresearch-webgpu - Browser ML optimization: ports Karpathy’s autoresearch into the browser so agents can generate training code, run GPU-backed experiments, and feed losses back into the next iteration.
  • autoresearch-local-llm - Local ML optimization: replaces Claude Code with a local Qwen model to run the standard autoresearch keep/revert loop on a shared single GPU.
  • Shopify Liquid performance work via autoresearch - Software optimization: Tobi Lütke applied an autoresearch loop to Shopify’s Liquid template engine, producing 93 automated commits that improved parse+render performance by 53% with 61% fewer allocations.
  • Autoresearch for SAT Solvers - SAT solver optimization: runs parallel MaxSAT experiments, updates reusable solver code plus expert memory, and improves public benchmark configurations against 2024 competition baselines.
  • autoresearch — Heuristic CP Edition - Heuristic solver optimization: adapts autoresearch to C++ competitive-programming solvers by editing only solver.cpp, scoring fixed benchmark instances, and keeping only lower average solution costs.
  • Autoresearch for game development - HTML5 Game Development: Runs agents to build better games based on player feedback and usage metrics, benchmarks using game ELO in 1/1 matchups.
  • SiliconSwarm@Ensue - Apple Silicon inference optimization: uses a multi-agent autoresearch loop to test ANE graph changes across chips and reports up to 6.31× lower median DistilBERT latency than CoreML.
  • Flash-MoE - Apple Silicon inference optimization: uses a Claude Code autoresearch loop to run 43 Metal optimization experiments on Qwen3.5-397B and reach 20.34 tok/s on an M5 Max by overlapping SSD reads with GPU compute.
  • Research-Driven Agents: When an agent reads before it codes - LLM inference optimization: extends Karpathy’s autoresearch with a literature-review phase that reads papers and competing forks before parallel llama.cpp experiments, landing five kernel fusions and about 15% faster x86 flash-attention generation in about 3 hours.
  • Rails controller tuning with Claude Code /loop autoresearch - Backend performance optimization: adapts Karpathy’s keep/discard loop to Rails controller latency by locking benchmark scripts and test data, running 10-minute cycles, and auto-reverting regressions.
  • Pytest speedups via autoresearch feedback loops - Test performance optimization: applies autoresearch to a backend pytest suite with a fixed evaluation harness, seven autonomous experiments, and a 295s → 71s keep/discard improvement path.
  • autoresearch-sudoku - Solver optimization: uses an enhanced autoresearch loop to rewrite a Rust sudoku solver over 312 experiments and beat Tdoku plus rust_sudoku on 4 of 6 standard benchmark datasets.
  • autospec - Backend service generation: applies an autoresearch-inspired keep-or-revert loop to natural-language business rules, iteratively building a Spring Boot service until Gradle and JUnit evaluation pass without regression.
  • How I used autoresearch to fix Gumroad’s flaky tests in a week - Test reliability: uses OpenClaw autoresearch to run 206 commits and 94 CI cycles that fixed 13 flaky tests while surfacing a real file-ID remapping bug.
  • WinMoE - Windows inference optimization: uses an AI-driven autoresearch methodology with one-change measurements and keep-or-reject ledgers to lift Qwen3.5-397B throughput from 0.44 to 1.9 tok/s on consumer hardware.
  • ZK Autoresearch — Plonky3 DFT Optimizer - ZK prover optimization: applies Karpathy’s autoresearch pattern to Plonky3’s DFT code, running Rust tests plus Criterion benchmarks and keeping only commits that reduce coset_lde_batch time on BabyBear field workloads.
  • autoresearch-go-ane - Apple Silicon training optimization: ports Karpathy’s loop to a Go plus ANE LLM trainer, benchmarking fixed 5-minute TinyStories runs with benchstat and keeping only lower val_loss configurations.
  • openroad-autoresearch-ibex - Chip design optimization: applies a fixed-harness autoresearch loop to OpenROAD RTL-to-GDSII experiments on the IBEX CPU, using scout-promote screening and objective-aware history to keep only timing, area, or power improvements.
  • OpenCLI - Browser automation reliability: adds a Karpathy-style autoresearch harness to OpenCLI, cycling review → modify → commit → verify → decide against fixed V2EX, Zhihu, browser, and save-as-CLI eval suites to keep only reliability improvements.
  • autoresearch-cublas-sam3 - GPU kernel optimization: applies an autoresearch loop to SAM3 GEMM tuning by mutating one config at a time, benchmarking on real GPUs, and keeping only changes that improved throughput by 2.14% over 120 experiments on an RTX 3090.
  • autoresearch-mamba - Mamba training optimization: adapts Karpathy’s fixed-evaluator, 5-minute keep/discard loop to MLX Mamba-2, Mamba-3, and hybrid Mamba-Transformer models on Apple Silicon by mutating one training surface to lower val_bpb.
  • liltrAIner - Local LLM fine-tuning optimization: applies a Karpathy-style autoresearch loop to MLX LoRA runs on Apple Silicon, letting an agent mutate training data or config, score eval prompts, and keep or revert each fine-tuning experiment.
  • english-app - Education app optimization: applies an autoresearch-inspired proposer → implement → test → evaluate → keep/discard loop to an English learning app, using pytest, TypeScript checks, and smoke tests to keep only changes scoring at least 6.0 across 10 autonomous iterations.
  • How we built the best browser agent with Auto-Research - Browser automation optimization: uses parallel Claude Code auto-research loops against Online-Mind2Web, running 20-cycle harness edits with train/validation splits and reaching 97% on the benchmark while rejecting task-specific overfits.
  • Speed up code with pi-autoresearch - Software performance optimization: applies pi-autoresearch to jsonista’s JSON decoding benchmark, keeping only measured wins and lifting one selected benchmark’s throughput by 56% while surfacing overfitting risks in accepted diffs.
  • 588x Faster SQLite Ingestion With an Autoresearch Loop - ETL performance optimization: applies pi-autoresearch to a Python financial-data ingestion pipeline, benchmarking 50,000-row SQLite writes and keeping fixes that cut processing time from about 397s to 0.675s.
  • auto-data-pipeline-optimization - Data pipeline optimization: adapts Karpathy’s three-file keep/revert loop to data engineering, letting an agent mutate compression, partitioning, chunking, and query logic against a fixed efficiency score and achieving +11.3% improvement over 20 experiments.
  • nnmetal + labrat - Apple Silicon inference optimization: uses an autonomous Zig and Metal autoresearch loop that snapshots engine files, makes one kernel change at a time, runs compile, test, and benchmark gates, and commits only throughput or latency wins above a fixed threshold.
  • HashSmith, Part 3: I Automated My Way to a 27% Faster Hash Table - Data-structure performance optimization: uses a Claude Code auto-optimize loop to profile, benchmark, and keep only wins on a JVM SwissTable implementation, landing three accepted changes and 13%-32% gains across eight benchmark scenarios.
  • claude-code-bench - AI coding workflow optimization: applies Karpathy-style autoresearch to Claude Code’s 7-dimensional configuration space, running benchmark tasks and keeping only profiles that improve quality-adjusted scores for research depth, correctness, and convention adherence.
  • autooptimization - Systems optimization: applies a profile-first autoresearch protocol to codebases like ClickHouse, Chroma, DataFusion, and RocksDB, keeping only statistically benchmarked optimizations backed by stack-level profiling evidence.
  • helix-inference-opt - LLM inference optimization: applies a fixed 1-minute autoresearch benchmark to Qwen2.5-0.5B decoding on WikiText-2, rewriting only infer.py and keeping throughput gains only when bits-per-byte quality stays within a 1% guard.
  • autoresearch-inference-optimization - Inference serving optimization: lets an agent rewrite serve.sh plus experiment.yaml, benchmark OpenAI-compatible servers under throughput, latency, and memory constraints, and keep only higher-scoring serving configs in experiments.jsonl.
  • PolyTrader - Trading-system performance optimization: applies autoresearch to PolyTrader’s signal-detection hot path, keeping only test-clean code changes that cut end-to-end tracker latency from 25.7 ms to 0.46 ms across a published 10-iteration benchmark run.
  • autoresearch-lora-buzhou - Local LoRA fine-tuning optimization: adapts autoresearch to user-chosen LoRA training goals by establishing a confirmed baseline, changing one parameter at a time, rerunning >1% wins for confirmation, and promoting only verified val_loss improvements to the best checkpoint.
  • ZeroAPI - Model-routing optimization: applies an autoresearch loop to OpenClaw routing policy constants by rerunning evals on routing logs, changing one setting at a time, and keeping only traffic-verified improvements.
  • One Hour, 37% Faster: Applying Autoresearch to Our Search Ranking Inference Endpoint - Search ranking optimization: applies Karpathy’s edit → measure → keep/revert loop to idealo’s production LTR inference endpoint, reducing preprocessing latency 5.9× and end-to-end latency 37% in one hour for ~$7.
  • smallnest/autoresearch - Software development automation: generalizes Karpathy’s autoresearch into a multi-agent, multi-source (GitHub, local, Baidu iCafe, Codeup) issue-to-PR pipeline with rotator review, scoring gates, and automatic merge for any Git project.
  • arete autoresearch-compiler - Compiler optimization: applies a Karpathy-style keep/revert loop to a self-hosting Scheme compiler, benchmarking fixed workloads and keeping only changes that improve runtime performance.
  • auto-arch-tournament - CPU architecture optimization: extends Karpathy’s autoresearch loop to SystemVerilog RTL, proposing, implementing, and measuring microarchitectural hypotheses through riscv-formal, Verilator cosim, and FPGA place-and-route to achieve +91.9% CoreMark on a RISC-V core.
  • autoresearch-function - Function optimization: adapts Karpathy’s autoresearch to a CPU-only scaffold that benchmarks arbitrary functions against correctness, execution time, memory, and throughput metrics, keeping only edits that improve a composite score.
  • vllm-autoresearch - Inference serving optimization: applies Karpathy’s autoresearch to find optimal vLLM serving configurations for Llama 3.3 70B on 2× RTX 5090, editing one config at a time and benchmarking against throughput, latency, and noise-floor metrics with keep/discard decisions.

Evaluation / Red Teaming

Source file: categories/evaluation-red-teaming.md

  • Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
  • autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
  • autoresearch-prompt-optimization - Prompt evaluation: applies the autoresearch loop to a fixed extraction benchmark, iteratively editing one prompt and keeping only accuracy gains on the eval set.
  • We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests - AI skill evaluation: runs a prompt-migration skill against six fixed codebase test cases, scores each change on correctness, completeness, and efficiency, and keeps only improvements while cherry-picking around harness overfit.
  • AutoPrompter - Prompt evaluation: combines promptfoo-style metrics with autoresearch-style closed-loop iteration, generating datasets, testing target models, and refining prompts through a persistent experiment ledger.
  • AutonomousTester - UI testing evaluation: adapts autoresearch to Playwright test generation by editing only tests/test_suite.py, measuring coverage_score, and auto-fixing or discarding test changes until coverage improves.
  • Autoresearch for Agents from Scratch - Support-agent prompt evaluation: applies Karpathy’s keep/revert loop to system_prompt.md, scoring frozen adversarial support cases by tool-call accuracy and lifting the prompt from 0.05 to 0.80 over 15 experiments.
  • LLM Privacy + Cost Router — Classifier Experiment - Privacy classification evaluation: runs a Karpathy-style autoresearch experiment across regex and prompt variants for a hybrid LLM privacy classifier, validating the best configuration at 96.7% holdout accuracy with 4.6% false negatives.
  • AutoMemory - Agent memory evaluation: lets an agent rewrite its own memory system against LongMemEval, using an immutable evaluator over random question samples and iterating on code plus strategy notes in response to scored failures.
  • How to stop your autoresearch loop from cheating - Autoresearch evaluation hardening: reports 71 experiments across nanochat training and MoE compression, showing loops drift quickly unless experiments are isolated and evaluator gates block shortcut gains.
  • Autoreason - Output evaluation: extends Karpathy-style autoresearch to subjective writing and coding tasks by running incumbent-versus-revision-versus-synthesis tournaments under blind multi-judge Borda scoring and stopping only when the unchanged version wins twice, outperforming standard self-refinement baselines on writing tasks and 150 CodeContests problems.
  • AutoResearchBench - Evaluation benchmark: a dedicated benchmark for autonomous scientific literature discovery with complementary deep and wide research tasks, showing that even the most powerful LLMs achieve only ~9% accuracy on literature-focused open-ended search.
  • Jailbreak Autoresearch - Jailbreak red-teaming: applies Karpathy’s autoresearch loop to prompt-harness experiments, testing different header and footer harnesses against a fixed body and scoring responses against a rubric to find more effective jailbreak strategies.
  • JustAsk - System prompt extraction: uses autoresearch-style curiosity-driven code agents to iteratively probe and reveal system prompts from frontier LLMs, accepted at ICML 2026.
  • DSBench AutoResearch - Benchmark evaluation: applies Karpathy-style hill-climbing to each of DSBench’s 74 Kaggle modeling and 38 Modeloff analysis tasks, running per-task autoresearch repos with a forensic audit that beats 82/112 DSBench baselines.

Finance / Trading

Source file: categories/finance-trading.md

  • atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.
  • autoresearch-trading - Options trading: applies an autoresearch-style keep/revert loop to SPY strategy parameters, logging each experiment against backtest metrics.
  • autoresearch-trading - Trading research: combines Karpathy-style autoresearch with classical optimization so the agent iterates on strategy structure while an optimizer tunes parameters and walk-forward validation decides what survives.
  • BTCautoresearch - Bitcoin forecasting: uses Karpathy-style autoresearch to mutate a single formula file, score walk-forward out-of-sample RMSE, and keep only forecasting rules that beat the baseline power law.
  • autoresearch-skfolio - Portfolio optimization: edits a single portfolio-research script, runs fixed out-of-sample validation across multiple datasets and reversed-return variants, and keeps only Deflated Sharpe Ratio gains.
  • AutoHypothesis - Portfolio research: runs a Karpathy-style autoresearch loop on agent.py, iterating on DEV data and keeping only stock-selection hypotheses that clear one-shot holdback and walk-forward gates before final holdout evaluation.
  • autoresearch-glm - Credit scoring: adapts autoresearch to Taiwan credit-default prediction by editing feature-policy code and keeping only validation AUC gains in a fixed logistic-GLM benchmark.
  • Feature Research at Machine Speed - Credit underwriting: adapts Karpathy’s autoresearch to compute_features.py, scoring each iteration by Information Value and ending the round after 25 runs once no new strong features emerged, yielding 58 strong features and a +0.023 AUC lift at a matched 20-feature budget.
  • autoresearch-markets - Prediction-market trading research: adapts Karpathy’s single-file keep/revert loop to Kalshi data, editing train.py and optimizing val_logloss on held-out resolved markets.
  • Simmer Autoresearch - Prediction-market trading: lets agents mutate skill configs, measure P&L or edge over live trading cycles or historical replays, and auto-commit only the configurations that improve results.
  • Autonomous Trading Strategy Research - Crypto trading research: adapts Karpathy’s single-file autoresearch loop to Hyperliquid perpetual futures, backtesting each strategy.py change on fixed historical data and keeping only score improvements across 103 autonomous experiments.
  • PolyEdge AutoResearch - Prediction-market arbitrage: applies a Karpathy-style keep/discard loop to Polymarket Up/Down paper trading, mutating one strategy parameter at a time and scoring each multi-window run on P&L, fill rate, and trading frequency.
  • AutoResearch — Autonomous DEX Strategy Discovery - DEX trading research: applies Karpathy-style autoresearch to Base DEX strategies, backtesting one mutation at a time against real Uniswap V3 and Aerodrome data and lifting composite score from 0.421 to 8.176 over 230+ experiments.
  • Winning the Paradigm Prediction Market Challenge with Claude Code - Prediction-market market making: uses parallel Claude Code agents as an autoresearch swarm to generate 1,039 strategy variants, run 2,000+ evaluations, and optimize mean edge to a first-place finish in Paradigm’s challenge.
  • Autoresearch Trading Strategy Optimizer - Crypto trading research: applies Karpathy’s autoresearch to one editable strategy.py, hill-climbing on deterministic historical backtests and keeping only commits that improve final_portfolio_value / max_drawdown.
  • Investing Autoresearch - Trading strategy research: uses an autonomous Claude loop to rewrite strategy.py, backtest on held-out market data, and keep only strategies that improve out-of-sample Sharpe under walk-forward, slippage, and fee validation.
  • EMA Crossover Autoresearch - Equity trading research: adapts Karpathy’s three-file autoresearch loop to an SBIN EMA strategy, mutating only strategy.py, backtesting a fixed 10-year Indian equities dataset, and keeping only changes that improve a composite return, Sharpe, and drawdown score.
  • NSE AutoResearch - Equity trading research: adapts Karpathy’s single-file keep/revert loop to Indian stock backtesting by mutating only strategy.py and keeping only lower composite-score strategies over a fixed 10-year NSE harness.
  • delu-agent - Live crypto trading: runs a self-improving treasury agent on Base with 5 parallel autoresearch loops that evolve the scoring model through 9,000+ backtested experiments 24/7 and execute trades via Bankr without human intervention.
  • investment-autoresearch - Trading strategy research: a Claude Code plugin that runs parallel agents testing single-hypothesis strategy variants in isolated git worktrees, scoring each against baseline buy-and-hold, and resetting from verified insights when incremental search plateaus.
  • autoresearch-crypto - Crypto trading research: applies Karpathy’s single-file, backtest-driven keep/revert loop to crypto strategy development with fixed historical data and metric-based acceptance gates.
  • feat(autoresearch): autonomous experiment engine in Juspay Neurolink - Financial infrastructure research: merges an 8,700-line autonomous experiment engine with Karpathy-style sub-agents and git-backed keep/revert decisions into the Juspay Neurolink product codebase.
  • ml-vs-leadlag-jp-stock - Japanese equity trading: applies Karpathy’s autoresearch framework to Japanese stock prediction, comparing ML-driven Sharpe maximization against a classical lead-lag PCA strategy on reproducible benchmarks with published write-up.
  • autoresearch-backtesting - Equity trading: uses an autonomous LLM agent to iteratively rewrite its PyTorch network for SPY directional prediction, backtesting each mutation and keeping only Sharpe improvements in a train → backtest → keep/discard loop.
  • IPL Odds Multi-Agent Autoresearch - Sports betting research: deploys 7 Claude Code subagents to forecast IPL match outcomes, grades each forecast by Brier score, and autonomously mutates reasoning rules across a full season without human intervention.

Personal Knowledge / Humanities

Source file: categories/personal-knowledge-humanities.md

  • autoresearch-genealogy - Genealogy: uses Claude Code /autoresearch prompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault.
  • claude-obsidian - Personal knowledge: uses a Karpathy-style /autoresearch skill to run multi-round web research with gap-filling and file source-backed concept, entity, and synthesis pages into a compounding Obsidian wiki vault.

Infra / Skills / Forks

Source file: categories/infra-skills-forks.md

  • n-autoresearch - Autoresearch infra: extends Karpathy’s loop with structured experiment state, multi-GPU parallelism, adaptive search, and crash recovery.
  • autoresearch-evaluation-harness - Evaluation infrastructure: compares autoresearch-style proposal strategies under fixed task adapters, explicit scalar scoring, and hard keep/discard gates.
  • autoresearch-mlx - Apple Silicon fork: ports Karpathy’s autoresearch to MLX while keeping the fixed-time training budget, single mutable file, and git keep/revert loop.
  • Claude Autoresearch - Claude Code skill: generalizes Karpathy’s autoresearch into a reusable modify → verify → keep/discard loop for measurable goals beyond ML.
  • claude-autoresearch - Claude Code plugin: runs autoresearch on isolated branches with deterministic verification commands, scheduled overnight sessions, and structured keep/discard reports.
  • lazy-developer - Claude Code plugin suite: runs repeated autoresearch phases across coverage, build speed, test speed, complexity, and performance goals while enforcing per-phase file locks and revert-on-regression behavior.
  • codex-autoresearch - Codex skill: brings the autoresearch pattern to Codex for unattended metric-driven software iteration with automatic keep/discard decisions.
  • gemini-autoresearch - Gemini CLI and Antigravity skill: runs goal-driven overnight improvement loops with verify and guard gates, keeping metric wins and automatically reverting regressions.
  • autoresearch-plugin - Claude Code plugin: packages the Karpathy-style experiment loop into init/test/run commands for projects with explicit evaluation scripts and git rollback.
  • Artificial General Research - Optimization framework: turns measurable code optimization tasks into autoresearch loops with variance-aware acceptance, artifact detection, and exhausted-approach tracking.
  • autoresearch-engram - Memory extension: adds persistent recall, pattern extraction, and reflection steps to Karpathy’s autoresearch so the agent remembers what worked across long runs.
  • pi-autoresearch - pi extension: generalizes Karpathy’s autoresearch into experiment tools, a live dashboard, and slash-command skills for metric-driven optimization beyond ML.
  • openclaw-autoresearch - OpenClaw plugin: ports pi-autoresearch to OpenClaw with pending-run enforcement, confidence scoring, checkpoint files, and git-backed keep/discard semantics.
  • AutoResearch OpenClaw - OpenClaw package: runs Karpathy-style keep/revert loops on one text asset by measuring a scalar metric, committing winners, reverting losers, and logging each session to results.tsv.
  • autoresearch-opencode - OpenCode skill: ports pi-autoresearch into OpenCode as a pure skill that logs JSONL experiment runs and resumes autonomous keep/discard loops with built-in tools.
  • pi-autoresearch-studio - pi control plane: adds TUI and web dashboards, plan and ideas editing, and selective PR creation on top of pi-autoresearch sessions.
  • autoresearch-gen - Scaffold generator: interviews the user, generates a verified autoresearch experiment scaffold, auto-runs the baseline, and repairs broken generated code before handoff to the agent.
  • autoresearch-autoresearch - Meta-autoresearch repo: maintains a portable canonical loop distilled from karpathy/autoresearch and adjacent systems so new evidence can update a reusable agent-verifier architecture across domains.
  • Bilevel Autoresearch - Meta-autoresearch framework: adds outer loops that rewrite autoresearch search mechanisms themselves and reports multi-run gains on Karpathy’s training benchmark.
  • SkyPilot parallel autoresearch - GPU infrastructure: gives Karpathy’s autoresearch access to 16 GPUs so the agent can run parallel experiment waves, validate winners on faster hardware, and reach about 910 runs in about 8 hours.
  • autoresearch_deeplake_swarm - Cloud swarm infrastructure: extends Karpathy’s loop with Modal-powered parallel workers and a shared Deeplake experiment notebook so multiple agents can explore train.py concurrently and surface only the best surviving commits.
  • Autoresearch on Red Hat OpenShift AI - Kubernetes ML infrastructure: runs Karpathy’s autoresearch as a 24-hour OpenShift AI workload, packaging nanochat into containers that logged 198 experiments and improved validation loss by 2.3% without human intervention.
  • serverless-autoresearch - SageMaker infrastructure: parallelizes Karpathy’s autoresearch on Spot training jobs so the agent evaluates train.py candidates with HUGI-style burst compute instead of paying for idle GPUs.
  • autoresearch-win-rtx - Windows GPU fork: ports Karpathy’s single-file, 5-minute, val_bpb keep/discard loop to native Windows on consumer RTX GPUs.
  • autoresearch-amd - AMD GPU fork: ports Karpathy’s single-file, 5-minute val_bpb keep/discard loop to ROCm by replacing Flash Attention 3 with portable SDPA for RDNA 4 cards.
  • autoloop - Agent runtime: generalizes Karpathy’s autoresearch into bounded repo-level loops with inferred eval commands, explicit guardrails, and keep/discard decisions across multiple coding agents.
  • GOAL.md - Goal-spec framework: generalizes Karpathy’s autoresearch to repos without native scalar metrics by constructing a project-specific fitness function in GOAL.md, then running measure → act → verify → keep/revert loops against it.
  • autoresearch-claude-code - Claude Code plugin: ports pi-autoresearch into a pure plugin skill with JSONL state, slash-command control, and autonomous keep/discard loops for arbitrary METRIC-based benchmarks.
  • autoresearch-benchmark - Benchmarking infrastructure: compares four autoresearch-style tools on the same sorting-throughput task and records both performance gains and iteration behavior under a shared setup.
  • CORAL - Multi-agent autoresearch infrastructure: runs Claude Code, Codex, or OpenCode workers in isolated worktrees, grades each attempt with coral eval, and keeps scored improvements while sharing notes and skills across agents.
  • autoresearch for agents - Agent evaluation template: adapts autoresearch to agent.py plus fixed run_eval.py and dataset.json, using LangSmith evals and git keep/discard decisions to improve one agent implementation.
  • autoresearch-automl - Benchmarking research: compares nine classical, LLM-based, and hybrid optimizers on Karpathy’s nanochat task under a shared 24-hour budget, showing code-editing autoresearch is competitive but fixed-space classical HPO still wins.
  • autoresearch-anycloud - Cloud GPU infrastructure: wraps Karpathy’s autoresearch in a unified Mac and cloud runner with platform setup, budget watchdogs, result collection, and automatic teardown across AWS, GCP, Azure, and OCI.
  • skill-autoresearch for Hermes Agent - Hermes skill: optimizes prompts, scripts, and validators through baseline → diagnose → patch → re-evaluate → keep/revert loops with dependency checks and conservative holdout rules.
  • autoresearch-anything - Claude Code skill: scaffolds Karpathy-style autoresearch pipelines for measurable business metrics by generating connectors, persistence setup, and deploy → measure → keep/discard loops around API-observable outcomes.
  • AutoSkill - Skill prompt optimization framework: applies Karpathy’s keep/revert loop to SKILL.md, mutating one prompt at a time against test cases and improving an auto-reminder skill from 45% to 90% reliability over 60+ autonomous iterations.
  • EvoSkill - Skill-evolution framework: analyzes failed coding-agent trajectories, proposes skill or prompt changes, evaluates them against benchmarks, and keeps only better agent variants in a Karpathy-style self-improvement loop.
  • Skill Forge v2 - Skill and code optimization framework: adapts Karpathy’s autoresearch to SKILL.md files and generic codebases, using dry-run validation, objective deltas, and keep/revert thresholds to steer autonomous or guided experiment loops.
  • autoimprove-cc - Claude Code skill optimizer: applies a Karpathy-style autoresearch loop to SKILL.md, scoring binary assertions from eval.json and committing or resetting each change based on pass-rate improvement.
  • ehmo/autoresearch-skill - Claude Code and Codex skill: generalizes autoresearch into clean-room red, green, and refactor teams that iteratively find issues, fix them under test, and simplify code on a feature branch while the coordinator keeps only verified progress.
  • ResearcherSkill - Claude Code and Codex skill: generalizes autoresearch into git-backed .lab/ sessions with branching experiment trees, convergence detection, and commit/revert control, improving Yggdrasil agent rules from 1.82 to 7.04 in a published loop.
  • Litmus - Parallel ML research infrastructure: turns OpenClaw into a multi-agent autoresearch lab with branch-isolated workers, scheduled director and synthesizer roles, and keep/revert experiment commits plus shared discoveries and skills.
  • Autoresearch CLI - Cross-agent experiment infrastructure: packages Karpathy’s one-file, one-metric keep/revert loop as a Rust CLI that scaffolds configs, validates eval commands, records JSONL results, and installs slash-command skills into multiple coding agents.
  • codex-autoresearcher - Codex experiment infrastructure: runs optimization campaigns through separate worker and judge Codex processes, a static evaluate.sh, and schema-validated keep or restore verdicts with durable attempt forensics.
  • Codex Autoresearch Harness - Codex experiment harness: wraps codex exec in a bash loop so Codex can run one Karpathy-style experiment at a time, keep or reset train.py changes, and compare model variants on the nanochat benchmark.
  • ExAutoresearch - Elixir autoresearch framework: hot-loads one experiment module at a time, trains GPT variants under fixed budgets across distributed GPU nodes, and uses a referee plus dashboard to early-stop losers and persist the best surviving trials.
  • slowresearch - Delayed-feedback experiment skill: adapts autoresearch to content, outreach, pricing, and other publish-and-wait workflows by logging human-reported metrics and proposing the next hypothesis across long feedback cycles.
  • AutoAgent - Agent-engineering infrastructure: applies Karpathy’s autoresearch to a single-file Harbor agent harness, rewriting agent.py, benchmarking scored tasks, and keeping only prompt, tool, or orchestration changes that raise total score.
  • VibeHQ - Multi-agent coordination infrastructure: applies autoresearch to team protocol design by benchmarking agent swarms, analyzing failure logs, rewriting hub code via /optimize-protocol, and iterating until coordination flags and token waste fall.
  • helix - Agent-agnostic autoresearch infrastructure: generalizes Karpathy’s loop into reproducible helix.yaml + program.md repos with backend-swappable agents, append-only experiments.tsv ledgers, and independently verifiable example helices.
  • Autolab Companion Tools - Autoresearch companion infrastructure: adds statistical keep/discard verdicts, experiment-history steering, and multi-agent branch competitions to Karpathy’s GPT-pretraining loop through the autojudge, autosteer, and autoevolve CLIs.
  • autoresearch-cpu - CPU ML fork: ports Karpathy’s autoresearch to commodity CPUs by replacing Flash Attention with native SDPA, shrinking defaults for 30-minute local runs, and preserving the same one-file val_bpb keep/discard loop without CUDA.
  • autoresearch-cn - Chinese-language fork: translates Karpathy’s autoresearch into Chinese while preserving the original MIT license, single-file training loop, fixed-time budget, and val_bpb keep/discard git workflow.
  • autoresearcher2 - Bayesian research infra: extends Karpathy’s autoresearch with structured Bayesian experiment selection, active-inference-inspired exploration, learntropy-style appraisal, and persistent memory, running v4.8 in production on dual RTX PRO 6000 Blackwell GPUs across NanoGPT training and Atari Breakout RL domains.
  • multiautoresearch - Multi-agent research lab: organizes Karpathy-style autoresearch into three sub-projects (pre-training, post-training, inference optimization) with planner, researcher, reviewer, and memory-keeper sub-agents across Claude Code, Codex, OpenCode, and pi harnesses.
  • autoresearch-serverless - GCP serverless infra: deploys Karpathy’s autoresearch as a fully autonomous Cloud Run research lab with NVIDIA L4/RTX PRO 6000 GPUs, Gemini Flash reasoning, GCS checkpoints, and Cloud Workflow chaining for multi-day 24/7 studies.
  • hugoferreira/autoresearch - Codebase research framework: generalizes Karpathy’s loop into falsifiable hypotheses, isolated experiment worktrees, instrument-backed observations, strict gate review, and reusable lessons for measurable engineering goals.
  • scalar-loop - Experiment infrastructure: generalizes Karpathy’s autoresearch into a Python-core CLI that seals harness files, enforces repo scope, and keeps or reverts only edits that pass metric and guard commands.
  • Autoresearch Lab - Black-box experiment infrastructure: wraps any measurable code pipeline in Docker sandboxing, Python eval backends, host-side git commit/revert control, and keep/discard research loops.
  • evo - Parallel optimization framework: extends Karpathy-style autoresearch with discovered benchmarks, git-worktree tree search, parallel subagents, and commit-only-if-gate-passing score improvements.
  • autoresearch-builder - Claude Code plugin: auto-detects project type, mutates one target file, runs the chosen build, test, or train command, and keeps or discards each experiment based on a parsed scalar metric.
  • AdaL vs Claude Code: Autoresearch Benchmark - Benchmarking research: publishes head-to-head Karpathy-autoresearch runs on A10 and H100 hardware, comparing best val_bpb, experiment counts, and kept improvements between AdaL and Claude Code.
  • 达尔文.skill - Skill optimization framework: adapts Karpathy’s autoresearch to one SKILL.md at a time, rescoring structure plus real test outputs and keeping only higher-scoring revisions.
  • Community Computer - Collaborative experiment network: turns autoresearch runs into signed, peer-reproducible experiment records and auto-publishes keep/discard outcomes from Claude Code or pi over a Radicle peer-to-peer network.
  • autoresearch-macos - Apple Silicon fork: ports Karpathy’s autoresearch to macOS with MPS/SDPA fallbacks, preserving the fixed-time training budget, single mutable file, and git keep/revert loop on Apple Silicon Macs.
  • autoresearch-tenstorrent - Tenstorrent hardware port: ports Karpathy’s autoresearch to Tenstorrent Wormhole devices with a TT-XLA backend, preserving the 5-minute fixed budget and keep/revert loop while adapting to a non-CUDA AI accelerator.
  • autoresearch-prompt-optimization - Prompt optimization skill: adapts Karpathy’s autoresearch to autonomous prompt tuning, lifting accuracy from 74.72% to 100% across 8 keep/discard experiments with a locked eval script.
  • autoresearch-council-arena - Writing optimization framework: combines Karpathy’s autoresearch with llm-council so 4 LLMs compete, rank each other, and hill-climb toward the best argument through keep/discard iterations.
  • Auto-Research-Recipes - Training recipe research: turns agent-written code, real experiments, and evaluator feedback into a closed-loop autoresearch process that develops non-trivial training recipes.
  • autoresearch-everywhere - Cross-platform expansion: optimizes Karpathy’s autoresearch for broader hardware and environment support while preserving the fixed-budget keep/revert core loop.
  • awesome-autoresearch - Curated knowledge base: a parallel community-curated list of AutoResearch use cases with optimization traces and open-source implementations at 915+ stars.
  • Awesome-Auto-Research-Tools - Curated tool index: catalogs Auto-Research tools, skill frameworks, platform plugins, and benchmark harnesses across the broader ecosystem.
  • Autoharness - Agent harness optimization: lets agents explore changes to prompts, configs, and runtime context, run evals, and keep only improvements, achieving benchmark lifts up to 40.7%.
  • Thoth - Orchestration runtime: provides a dashboard-first control plane for autoresearch that turns drifting agent work into durable runs, locked work items, and reviewable verdicts across Claude Code and Codex.
  • GEPAResearch - Plugin for agentic frameworks: optimizes code using the Genetic-Pareto LLM-driven search algorithm with isolated git worktrees, gated evaluation, and Pareto frontier candidate tracking for keep/discard decisions.
  • Codex Autoresearch - Codex plugin: wraps optimization goals into measured experiment packets with benchmark contracts, durable session files, evidence trails, and live dashboards for keep/discard review.
  • Cursor Autoresearch - Cursor and VS Code extension: ports pi-autoresearch’s MCP-driven optimization loop into a code editor with JSONL tracking, browser dashboard, and agent skills for metric-driven keep/discard workflows.
  • interpretable-autoresearch - Governance infrastructure: makes autoresearch agent behavior transparent and auditable through behavioral specifications, winning 2nd place at the MIT Spring 2026 Hackathon.
  • skills-autoresearch-flue - Evaluation infrastructure: provides a Flue agent harness that evaluates a seed skill against project fixtures, asks a researcher model to improve the skill, then reruns evals against the candidate.
  • autoresearch-plugin - Research pipeline plugin: packages a full autonomous research pipeline as a Claude Code plugin with 65 skills across 9 families, guiding a direction from vague idea to submission-ready paper.
  • research-program-skill - Research workflow skill: combines ml-intern’s discovery discipline with autoresearch’s baseline-first, metric-first loop into a portable skill for Claude Code, Codex CLI, and Cursor Agent.
  • autoimprove - Codebase improvement infra: generalizes Karpathy’s autoresearch for Claude Code and Codex with git worktree isolation, configurable improvement strategies, and keep/discard experiment logs.
  • harvester - Code maintenance infra: applies the Karpathy autoresearch pattern to watch repos, surface improvement opportunities as GitHub issues, execute approved changes overnight via Claude Code, and produce draft PRs for morning review.
  • Autoresearch - Parallel experimentation framework: spawns supportive and adversarial Claude Code workers to collect evidence for or against research directions, with a judge scoring write-ups and curating the roadmap through iterative convergence detection.
  • pi-autoresearch-harness - pi extension: forks pi-autoresearch with worktree-isolated experiment harness, adding init/run/log experiment tools and a slash-command dashboard for metric-driven keep/discard loops.
  • Spark Researcher - Lightweight autoresearch framework: blends Karpathy’s compact loop with domain chips, collective intelligence network, and bounded self-editing for disciplined local experiment management.
  • EvE: Evolutionary Ensemble of Agents - Multi-agent evolution framework: wraps coding agents into a decentralized evolutionary ensemble that co-evolves solver code and agent guidance through autoresearch-style competition, published as arXiv:2605.09018.
  • deep-evolve - Claude Code plugin: generalizes autoresearch from ML training to any software project with automatic evaluation harness generation, journal-based crash recovery, and self-evolutionary outer-loop strategy improvement.
  • AutoResearch-AI - Autoresearch infra: provides an npm CLI that installs a durable research harness into ML repos so agents can inspect code, propose experiments, run baselines, and continue loops without losing context across sessions.
  • autoresearch - Autoresearch infra: provides a self-driving experiment sweep loop with daemon-detached autoresearch.py, live PR-updating progress charts, and GPU monitoring for ML training projects.
  • pi-lifeline - Autoresearch infra: pi extension that lets smaller autoresearch models request strategic guidance from a stronger advisor model when stuck on plateaus, implementing rate-limited escalation with configurable consecutive-failure and plateau-run thresholds.
  • autoresearch-skill-Andrej-Karpathy - Claude Code skill: turns natural-language optimization goals into autonomous goal-parsing, baseline-capture, and keep/discard loops with real-data validation, regression protection, and a reported 76% P99 latency reduction on a live signal-trading pipeline.

Related Practices / Discussions

Source file: categories/related-practices-discussions.md

Trading / markets

Business / GTM workflows

Workflow automation / consumer ops

Prompt / evaluation

Software / code workflows

Scientific / research augmentation

Infra / benchmarking ideas

Software / code workflows (continued)

  • Autoresearch Is Reward Function Design - Blog: applies pi-autoresearch to optimize a Python finance scoring pipeline across 49 experiments with keep/discard, cutting p95 latency from 339ms to 34ms and arguing that autoresearch success hinges on reward-function design quality rather than the loop mechanics.

Infra / benchmarking ideas (continued)

Knowledge Base / RAG Preparation

Source file: categories/knowledge-base-rag-preparation.md

  • autoresearch-genealogy - Genealogy: uses Claude Code /autoresearch prompts to expand family trees, verify claims against multiple sources, and keep a structured evidence-backed research vault.
  • AutoRAGsearch - RAG retrieval optimization: applies an autoresearch-style loop to a fixed QA benchmark by editing only rag_pipeline.py, running local retrieval experiments, and improving retrieval_score from 0.9472 to 0.9867 over 20 autonomous experiments.
  • claude-obsidian - Knowledge-base preparation: uses a Karpathy-style /autoresearch skill to gather sources, fill research gaps across three search rounds, and write structured source, concept, entity, and synthesis pages into a retrieval-ready Obsidian vault.

Market Research

Source file: categories/market-research.md

  • atlas-gic - Trading: applies Karpathy-style autoresearch to a swarm of market agents, rewriting the worst-performing prompts and keeping changes only when rolling Sharpe improves.

Workflow Automation

Source file: categories/workflow-automation.md

  • AutoResearchClaw - Scientific research: turns a research idea into a paper through a fully autonomous multi-stage pipeline with self-healing experiments and pivot/refine loops.
  • Claudini - AI safety research: uses an autoresearch-style loop to invent and benchmark new LLM attack algorithms, keeping only methods that outperform baselines.
  • AutoKernel - GPU optimization: applies Karpathy-style autoresearch to kernel bottlenecks, iterating on code, benchmarking, and keeping only changes that improve speed without breaking correctness.
  • autovoiceevals - Voice AI evaluation: attacks voice agents with adversarial callers, proposes prompt changes one at a time, and keeps or reverts them based on eval results.
  • PM document optimizer - Product workflow automation: applies a Karpathy-style git ratchet to markdown artifacts like PRDs and strategy docs, scoring each draft with programmatic checks and committing only higher-scoring revisions.
  • Trip Optimizer Pro - Travel planning workflow automation: applies the autoresearch pattern to itinerary generation by researching destinations, scoring multi-day plans, and keeping only itinerary mutations that improve a weighted travel-quality score.
  • Autoresearch for Software Development - Software delivery automation: adapts autoresearch to GitHub Issues by rotating agents through implement → review → fix loops, then auto-merging only issues that clear a score gate plus build, lint, and test checks.
  • autobrowse - Browser workflow automation: applies an autoresearch loop to browser tasks by iterating on strategy.md, replaying tasks with Browserbase, and graduating only reliable workflows into reusable Claude Code skills.

Submission format

Use exactly one line per entry:

- [Name](URL) - Industry: one-sentence description of the autoresearch use case.

How to contribute

  1. Pick the category that best matches the direct autoresearch application domain.
  2. Add a single-line entry in the required format to the category file, not directly to the README aggregate.
  3. Keep the summary concrete and scannable.
  4. Prefer examples that clearly show scenario + autoresearch loop + value.

See CONTRIBUTING.md for details.

License

MIT

相似文章