@geekbb: Agent harness 自动化优化工具,接管了 Agent harness 优化的脏活,你给一个基准测试命令和目标仓库,它就自动生成提案、跑评测、记结果、留好的,弃差的,自动改进 agent 的 prompt、配置和源码。 https…

X AI KOLs Timeline 工具

摘要

autoharness 是一个自动化代理 harness 优化工具,能基于基准测试命令自动生成提案、运行评估并改进 agent 的 prompt、配置和源码,支持 Codex 和 Claude。

Agent harness 自动化优化工具,接管了 Agent harness 优化的脏活,你给一个基准测试命令和目标仓库,它就自动生成提案、跑评测、记结果、留好的,弃差的,自动改进 agent 的 prompt、配置和源码。 https://t.co/2qhYImGjuP https://t.co/t9qGZMZjkP
查看原文
查看缓存全文

缓存时间: 2026/05/11 12:42

Agent harness 自动化优化工具,接管了 Agent harness 优化的脏活,你给一个基准测试命令和目标仓库,它就自动生成提案、跑评测、记结果、留好的,弃差的,自动改进 agent 的 prompt、配置和源码。

https://t.co/2qhYImGjuP https://t.co/t9qGZMZjkP


kayba-ai/autoharness

Source: https://github.com/kayba-ai/autoharness

autoharness banner

Discord Twitter Follow kayba.ai MIT license Python 3.11+ Codex supported Claude Code supported

autoharness

Let autoharness run overnight and come back to an optimized agent harness, so your production agents never make mistakes again.

autoharness improves agent harnesses by proposing or applying prompt, config, middleware, and source changes, running evals, and keeping or discarding candidates based on benchmark results.

It is a control plane for an existing harness repo. You point it at a target root and a benchmark command; autoharness manages proposals, iterations, campaigns, and champion state under .autoharness/.

Install

Fastest setup with Codex or Claude:

  1. pipx install "git+https://github.com/kayba-ai/autoharness.git"
  2. cd into your harness repo
  3. open Codex or Claude Code in that repo
  4. tell the assistant: Run autoharness guide --assistant codex --print-next-prompt, then use the generated onboarding packet to finish setup.

For Claude Code, swap --assistant codex for --assistant claude.

Else:

pipx install "git+https://github.com/kayba-ai/autoharness.git"
autoharness --help

If you do not use pipx:

python3 -m pip install --user "git+https://github.com/kayba-ai/autoharness.git"

How It Works

  • guide inspects a repo, asks a few focused setup questions in a TTY, stays scriptable with flags in non-interactive use, writes a starter autoharness.yaml plus benchmark config, and runs a readiness check.
  • doctor reruns config, generator, and benchmark validation when you want an explicit readiness gate.
  • setup and init remain available when you want to manage bootstrap explicitly.
  • run-benchmark executes one benchmark directly.
  • generate-proposal previews one candidate change without running it.
  • run-iteration or optimize executes one candidate or a resumable search loop.
  • promote or promote-from-compare moves a winner into champion state.

Mental Model

  • target root: the harness repo or deployment tree to edit
  • benchmark config: the command or adapter config that scores candidates
  • workspace: the long-lived optimization effort
  • track: one comparable lane inside a workspace
  • campaign: a resumable search run over candidate proposals
  • .autoharness/: persisted settings, proposals, records, iterations, and champions

Batteries Included

  • Adapters: generic_command, pytest, harbor, tau2_bench, hal, car_bench
  • Proposal generators: manual, failure_summary, local_template, local_command, openai_responses, codex_cli, claude_code
  • Extension model: Python plugins can add generators, preflight checks, and search strategies from .autoharness/plugins/ or AUTOHARNESS_PLUGIN_PATHS

Quick Start

Let autoharness generate a starter project config:

autoharness guide

In a TTY, guide asks a few setup questions. In scripts or CI, use flags like --non-interactive, --benchmark-command, --generator, and --autonomy.

If you want Codex or Claude to help you refine the setup, generate an assistant brief too:

autoharness guide --assistant codex --print-next-prompt
# or
autoharness guide --assistant claude --print-next-prompt

This writes autoharness.codex.md or autoharness.claude.md plus a structured autoharness.onboarding.json handoff next to autoharness.yaml, then prints a ready-to-paste assistant prompt. Assistant wrapper prompts live under contrib/agents/.

guide ends with a doctor pass. Run autoharness doctor again later if you want an explicit re-check or a repeated benchmark probe.

On a fresh install, guide prefers a local assistant backend when codex or claude is installed, otherwise uses openai_responses when OpenAI credentials are configured, and falls back to failure_summary only when no model-backed generator is available.

Then run the benchmark directly:

autoharness run-benchmark

If autoharness.yaml is present, autoharness will auto-bootstrap missing settings and workspace state on this common path. setup and init are still available when you want explicit control.

Generate a proposal against a target harness root:

autoharness generate-proposal

If you switch the project config to openai_responses, export an API key first:

export OPENAI_API_KEY=...

Run the outer loop:

autoharness optimize
autoharness report

Early Results

Example from one tau2 airline benchmark study. Relative deltas are measured against the baseline harness on the same workload. Results depend on the benchmark, harness, and evaluation setup, and some intervention combinations can regress.

tau2 intervention deltas

Docs

For Power Users

  • Background campaign workers plus queue and worker-state inspection
  • Root-level memory, transfer suggestions, and portfolio scheduling
  • Retention policies, pruning, and portable report and bundle exports
  • Event logs, inspection commands, and operational reporting surfaces
  • Python plugin hooks for generators, preflight checks, and search strategies

Want deeper analysis or a custom optimization workflow? Kayba offers managed harness optimization and agent-improvement support tailored to your stack.

Star this repo if you find it useful!

Built with ❤️ by Kayba and the open-source community.

相似文章

@GitHub_Daily: 用 Claude Code 做复杂项目,单个 Agent 能力有限,想让多个 Agent 协作分工,但手动配置团队结构和技能文件太繁琐。 最近找到 Harness 这个 Claude Code 插件,一句话描述你的项目,它就能自动生成一整…

X AI KOLs Timeline

Harness 是一个 Claude Code 插件,能根据一句话描述自动生成多 Agent 团队架构,内置 6 种协作模式和 100 套现成配置,帮助 Claude Code 从单兵作战变为团队协作。

@XAMTO_AI: 想自己从零搭一个生产级 Agent Harness?别做梦了,以为随便挑个框架就能收工的,基本全翻车了。 真相是这玩意儿压根不是"选框架"能摆平的事,它背后藏着15项你绕不开的硬核职责: 每一项都得做成能安装、能版本化、还能换语言跑的 w…

X AI KOLs Timeline

The article argues that production agent harnesses should not be monolithic frameworks but rather a stack of independent, replaceable workers connected by a shared trigger primitive, outlining 15 core responsibilities and how the iii engine implements this approach.