@geekbb: Agent harness 自动化优化工具,接管了 Agent harness 优化的脏活,你给一个基准测试命令和目标仓库,它就自动生成提案、跑评测、记结果、留好的,弃差的,自动改进 agent 的 prompt、配置和源码。 https…
摘要
autoharness 是一个自动化代理 harness 优化工具,能基于基准测试命令自动生成提案、运行评估并改进 agent 的 prompt、配置和源码,支持 Codex 和 Claude。
查看缓存全文
缓存时间: 2026/05/11 12:42
Agent harness 自动化优化工具,接管了 Agent harness 优化的脏活,你给一个基准测试命令和目标仓库,它就自动生成提案、跑评测、记结果、留好的,弃差的,自动改进 agent 的 prompt、配置和源码。
https://t.co/2qhYImGjuP https://t.co/t9qGZMZjkP
kayba-ai/autoharness
Source: https://github.com/kayba-ai/autoharness
autoharness
Let autoharness run overnight and come back to an optimized agent harness, so your production agents never make mistakes again.
autoharness improves agent harnesses by proposing or applying prompt, config, middleware, and source changes, running evals, and keeping or discarding candidates based on benchmark results.
It is a control plane for an existing harness repo. You point it at a target root and a benchmark command; autoharness manages proposals, iterations, campaigns, and champion state under .autoharness/.
Install
Fastest setup with Codex or Claude:
pipx install "git+https://github.com/kayba-ai/autoharness.git"cdinto your harness repo- open Codex or Claude Code in that repo
- tell the assistant:
Run autoharness guide --assistant codex --print-next-prompt, then use the generated onboarding packet to finish setup.
For Claude Code, swap --assistant codex for --assistant claude.
Else:
pipx install "git+https://github.com/kayba-ai/autoharness.git"
autoharness --help
If you do not use pipx:
python3 -m pip install --user "git+https://github.com/kayba-ai/autoharness.git"
How It Works
guideinspects a repo, asks a few focused setup questions in a TTY, stays scriptable with flags in non-interactive use, writes a starterautoharness.yamlplus benchmark config, and runs a readiness check.doctorreruns config, generator, and benchmark validation when you want an explicit readiness gate.setupandinitremain available when you want to manage bootstrap explicitly.run-benchmarkexecutes one benchmark directly.generate-proposalpreviews one candidate change without running it.run-iterationoroptimizeexecutes one candidate or a resumable search loop.promoteorpromote-from-comparemoves a winner into champion state.
Mental Model
target root: the harness repo or deployment tree to editbenchmark config: the command or adapter config that scores candidatesworkspace: the long-lived optimization efforttrack: one comparable lane inside a workspacecampaign: a resumable search run over candidate proposals.autoharness/: persisted settings, proposals, records, iterations, and champions
Batteries Included
- Adapters:
generic_command,pytest,harbor,tau2_bench,hal,car_bench - Proposal generators:
manual,failure_summary,local_template,local_command,openai_responses,codex_cli,claude_code - Extension model: Python plugins can add generators, preflight checks, and search strategies from
.autoharness/plugins/orAUTOHARNESS_PLUGIN_PATHS
Quick Start
Let autoharness generate a starter project config:
autoharness guide
In a TTY, guide asks a few setup questions. In scripts or CI, use flags like --non-interactive, --benchmark-command, --generator, and --autonomy.
If you want Codex or Claude to help you refine the setup, generate an assistant brief too:
autoharness guide --assistant codex --print-next-prompt
# or
autoharness guide --assistant claude --print-next-prompt
This writes autoharness.codex.md or autoharness.claude.md plus a structured autoharness.onboarding.json handoff next to autoharness.yaml, then prints a ready-to-paste assistant prompt. Assistant wrapper prompts live under contrib/agents/.
guide ends with a doctor pass. Run autoharness doctor again later if you want an explicit re-check or a repeated benchmark probe.
On a fresh install, guide prefers a local assistant backend when codex or claude is installed, otherwise uses openai_responses when OpenAI credentials are configured, and falls back to failure_summary only when no model-backed generator is available.
Then run the benchmark directly:
autoharness run-benchmark
If autoharness.yaml is present, autoharness will auto-bootstrap missing settings and workspace state on this common path. setup and init are still available when you want explicit control.
Generate a proposal against a target harness root:
autoharness generate-proposal
If you switch the project config to openai_responses, export an API key first:
export OPENAI_API_KEY=...
Run the outer loop:
autoharness optimize
autoharness report
Early Results
Example from one tau2 airline benchmark study. Relative deltas are measured against the baseline harness on the same workload. Results depend on the benchmark, harness, and evaluation setup, and some intervention combinations can regress.
Docs
For Power Users
- Background campaign workers plus queue and worker-state inspection
- Root-level memory, transfer suggestions, and portfolio scheduling
- Retention policies, pruning, and portable report and bundle exports
- Event logs, inspection commands, and operational reporting surfaces
- Python plugin hooks for generators, preflight checks, and search strategies
Want deeper analysis or a custom optimization workflow? Kayba offers managed harness optimization and agent-improvement support tailored to your stack.
Star this repo if you find it useful!
Built with ❤️ by Kayba and the open-source community.
相似文章
Claude Code 在一夜之间将我的 Agent 框架性能提升了 40%
作者介绍了“Autoharness”,这是一个利用 Claude Code 通过迭代提示词和超参数来自主优化 Agent 框架的工具。在 tau2-airline 基准测试中,该工具使性能提升了 40%。
面向长时应用开发的Harness设计
Anthropic工程师详细介绍了一种多智能体Harness设计,利用生成器与评估器智能体提升Claude在长时间内自主构建完整、高质量前端应用的能力。
@astaxie: 今天群里面讨论怎么样学习 Harness,Harness 工程我学习这两个: 1. https://github.com/walkinglabs/learn-harness-engineering… 通过这个了解每一个 Harness 的…
A project-based course repository on Harness Engineering for AI coding agents, covering environment setup, state management, verification, and control mechanisms to make AI coding agents work reliably. The course synthesizes best practices from OpenAI and Anthropic on building effective harnesses for long-running agents.
你的框架辜负了你的智能体,但却没有基准来证明这一点
本文强调了缺乏用于评估智能体框架可靠性的基准测试,重点探讨了与模型本身相比,MCP 实现如何更好地处理工具调用和错误。
@teach_fireworks: AI Coding 现在开始进入一个很有意思的阶段。 过去大家讨论最多的是模型能力、上下文长度、Agent Loop、Tool Use、自动化编程,但真正把 Agent 长时间放进真实开发环境之后,很多团队发现问题已经不只是“能不能生成代…
介绍开源工具 re_gent,它为 AI 编程 Agent 提供运行时级别的版本控制和可观测性基础设施,解决 Agent 长时间运行后的代码溯源与审计问题。