@geekbb: Agent harness 自动化优化工具，接管了 Agent harness 优化的脏活，你给一个基准测试命令和目标仓库，它就自动生成提案、跑评测、记结果、留好的，弃差的，自动改进 agent 的 prompt、配置和源码。 https…

X AI KOLs Timeline 2026/05/11 08:36 工具

摘要

autoharness 是一个自动化代理 harness 优化工具，能基于基准测试命令自动生成提案、运行评估并改进 agent 的 prompt、配置和源码，支持 Codex 和 Claude。

Agent harness 自动化优化工具，接管了 Agent harness 优化的脏活，你给一个基准测试命令和目标仓库，它就自动生成提案、跑评测、记结果、留好的，弃差的，自动改进 agent 的 prompt、配置和源码。 https://t.co/2qhYImGjuP https://t.co/t9qGZMZjkP

查看原文导出为 Word 导出为 PDF

查看缓存全文

缓存时间: 2026/05/11 12:42

Agent harness 自动化优化工具，接管了 Agent harness 优化的脏活，你给一个基准测试命令和目标仓库，它就自动生成提案、跑评测、记结果、留好的，弃差的，自动改进 agent 的 prompt、配置和源码。

https://t.co/2qhYImGjuP https://t.co/t9qGZMZjkP

kayba-ai/autoharness

Source: https://github.com/kayba-ai/autoharness

autoharness banner

autoharness

Let autoharness run overnight and come back to an optimized agent harness, so your production agents never make mistakes again.

autoharness improves agent harnesses by proposing or applying prompt, config, middleware, and source changes, running evals, and keeping or discarding candidates based on benchmark results.

It is a control plane for an existing harness repo. You point it at a target root and a benchmark command; autoharness manages proposals, iterations, campaigns, and champion state under .autoharness/.

Install

Fastest setup with Codex or Claude:

pipx install "git+https://github.com/kayba-ai/autoharness.git"
cd into your harness repo
open Codex or Claude Code in that repo
tell the assistant: Run autoharness guide --assistant codex --print-next-prompt, then use the generated onboarding packet to finish setup.

For Claude Code, swap --assistant codex for --assistant claude.

Else:

pipx install "git+https://github.com/kayba-ai/autoharness.git"
autoharness --help

If you do not use pipx:

python3 -m pip install --user "git+https://github.com/kayba-ai/autoharness.git"

How It Works

guide inspects a repo, asks a few focused setup questions in a TTY, stays scriptable with flags in non-interactive use, writes a starter autoharness.yaml plus benchmark config, and runs a readiness check.
doctor reruns config, generator, and benchmark validation when you want an explicit readiness gate.
setup and init remain available when you want to manage bootstrap explicitly.
run-benchmark executes one benchmark directly.
generate-proposal previews one candidate change without running it.
run-iteration or optimize executes one candidate or a resumable search loop.
promote or promote-from-compare moves a winner into champion state.

Mental Model

target root: the harness repo or deployment tree to edit
benchmark config: the command or adapter config that scores candidates
workspace: the long-lived optimization effort
track: one comparable lane inside a workspace
campaign: a resumable search run over candidate proposals
.autoharness/: persisted settings, proposals, records, iterations, and champions

Batteries Included

Adapters: generic_command, pytest, harbor, tau2_bench, hal, car_bench
Proposal generators: manual, failure_summary, local_template, local_command, openai_responses, codex_cli, claude_code
Extension model: Python plugins can add generators, preflight checks, and search strategies from .autoharness/plugins/ or AUTOHARNESS_PLUGIN_PATHS

Quick Start

Let autoharness generate a starter project config:

autoharness guide

In a TTY, guide asks a few setup questions. In scripts or CI, use flags like --non-interactive, --benchmark-command, --generator, and --autonomy.

If you want Codex or Claude to help you refine the setup, generate an assistant brief too:

autoharness guide --assistant codex --print-next-prompt
# or
autoharness guide --assistant claude --print-next-prompt

This writes autoharness.codex.md or autoharness.claude.md plus a structured autoharness.onboarding.json handoff next to autoharness.yaml, then prints a ready-to-paste assistant prompt. Assistant wrapper prompts live under contrib/agents/.

guide ends with a doctor pass. Run autoharness doctor again later if you want an explicit re-check or a repeated benchmark probe.

On a fresh install, guide prefers a local assistant backend when codex or claude is installed, otherwise uses openai_responses when OpenAI credentials are configured, and falls back to failure_summary only when no model-backed generator is available.

Then run the benchmark directly:

autoharness run-benchmark

If autoharness.yaml is present, autoharness will auto-bootstrap missing settings and workspace state on this common path. setup and init are still available when you want explicit control.

Generate a proposal against a target harness root:

autoharness generate-proposal

If you switch the project config to openai_responses, export an API key first:

export OPENAI_API_KEY=...

Run the outer loop:

autoharness optimize
autoharness report

Early Results

Example from one tau2 airline benchmark study. Relative deltas are measured against the baseline harness on the same workload. Results depend on the benchmark, harness, and evaluation setup, and some intervention combinations can regress.

tau2 intervention deltas

Docs

For Power Users

Background campaign workers plus queue and worker-state inspection
Root-level memory, transfer suggestions, and portfolio scheduling
Retention policies, pruning, and portable report and bundle exports
Event logs, inspection commands, and operational reporting surfaces
Python plugin hooks for generators, preflight checks, and search strategies

Want deeper analysis or a custom optimization workflow? Kayba offers managed harness optimization and agent-improvement support tailored to your stack.

Star this repo if you find it useful!

Built with ❤️ by Kayba and the open-source community.

相似文章

Claude Code 在一夜之间将我的 Agent 框架性能提升了 40%

Reddit r/AI_Agents

作者介绍了“Autoharness”，这是一个利用 Claude Code 通过迭代提示词和超参数来自主优化 Agent 框架的工具。在 tau2-airline 基准测试中，该工具使性能提升了 40%。

面向长时应用开发的Harness设计

Anthropic Engineering

Anthropic工程师详细介绍了一种多智能体Harness设计，利用生成器与评估器智能体提升Claude在长时间内自主构建完整、高质量前端应用的能力。

@astaxie: 今天群里面讨论怎么样学习 Harness，Harness 工程我学习这两个： 1. https://github.com/walkinglabs/learn-harness-engineering… 通过这个了解每一个 Harness 的…

X AI KOLs Timeline

A project-based course repository on Harness Engineering for AI coding agents, covering environment setup, state management, verification, and control mechanisms to make AI coding agents work reliably. The course synthesizes best practices from OpenAI and Anthropic on building effective harnesses for long-running agents.

你的框架辜负了你的智能体，但却没有基准来证明这一点