@krystal_ning: 感谢分享我们的调研!我们还在维护一个 Awesome Code as Agent Harness Papers 仓库,用于收集近期关于…
摘要
Krystal Ning 分享了一个精选的 Awesome 列表仓库,收录关于以代码为中心的智能体系统和工具链工程的论文,该列表伴随一项名为“Code as Agent Harness”的调研。
查看缓存全文
缓存时间: 2026/05/20 06:25
感谢您分享我们的调查!我们还维护了一个 Awesome Code as Agent Harness Papers 仓库,收录了关于以代码为中心的智能体系统与操控框架工程的最新研究:https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers…
YennNing/Awesome-Code-as-Agent-Harness-Papers
来源:https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers
Awesome Code as Agent Harness Papers(代码作为智能体操控框架论文集锦)
Awesome (https://awesome.re)
arXiv (https://arxiv.org/abs/2605.18747)
官方网站 (https://code-as-harness.github.io/code-as-harness-webpage/)
Hugging Face 当日最佳论文 #1 (https://huggingface.co/papers/2605.18747)
@_akhaliq (https://x.com/_akhaliq/status/2056900568921133565?s=20)
访问者统计
本仓库是综述论文 《Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems》(代码作为智能体操控框架:迈向可执行、可验证、有状态的智能体系统)(https://arxiv.org/abs/2605.18747) 的配套资源。
我们研究代码在智能体 AI 中新兴的角色:代码不再仅仅是生成的产物,而是日益成为一种可执行、可检查、有状态的操控框架,智能体通过它进行推理、行动、建模环境、接收反馈以及协调。本仓库围绕三个相互关联的层次组织代表性论文:操控框架接口(Harness Interface)、操控框架机制(Harness Mechanisms) 和 操控框架扩展(Scaling the Harness),涵盖了编码助手、GUI/OS 自动化、科学发现和具身智能等方向。
👋 我们欢迎论文建议、拉取请求以及与代码作为智能体操控框架相关的合作。请联系
[email protected]、[email protected]、[email protected]、[email protected]和[email protected]。我们将持续更新本仓库,收录以代码为中心的智能体系统与操控框架工程的最新工作。
📚 如果您觉得本资源有用,请引用并给仓库点星 (https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers):
@article{ning2026codeasharness, title = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems}, author = {Ning, Xuying and Tieu, Katherine and Fu, Dongqi and Wei, Tianxin and Li, Zihao and Bei, Yuanchen and others}, journal = {arXiv preprint arXiv:2605.18747}, year = {2026} }
框架总览图
🔔 新闻
[2026-05] 🚀 我们的综述 《Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems》 已在 arXiv (https://arxiv.org/abs/2605.18747) 上线。幻灯片和项目页面链接将在可用后添加。
📋 目录
🧩 操控框架接口
代码作为模型与任务环境之间的基本接口。程序将模型输出转化为可执行、可检查、有状态的结构:代码使推理变得可执行,行动变得可编程,环境状态变得可检查。
操控框架接口示意图
💭 用于推理的代码
程序将内部逻辑外化为可验证的计算,允许解释器、符号求解器、执行轨迹或过程奖励来检查和优化中间步骤。
程序委托推理
| 论文 | 发表会议/期刊 |
|---|---|
| Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks (https://arxiv.org/abs/2211.12588) | TMLR 2023 |
| MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning (https://arxiv.org/abs/2310.03731) | ICLR 2024 |
| Chain of Code: Reasoning with a Language Model-Augmented Code Emulator (https://arxiv.org/abs/2312.04474) | ICML 2024 |
| Method-Based Reasoning for Large Language Models: Extraction, Reuse, and Continuous Improvement (https://arxiv.org/abs/2508.04289) | arXiv 2025 |
| Code-Enabled Language Models Can Outperform Reasoning Models on Diverse Tasks (https://arxiv.org/abs/2510.20909) | arXiv 2025 |
| When Do Program-of-Thought Works for Reasoning? (https://ojs.aaai.org/index.php/AAAI/article/view/29721) | AAAI 2024 |
| PAL: Program-aided Language Models (https://proceedings.mlr.press/v202/gao23f.html) | ICML 2023 |
| Show Your Work: Scratchpads for Intermediate Computation with Language Models (https://arxiv.org/abs/2112.00114) | arXiv 2021 |
| Reasoning Like Program Executors (https://aclanthology.org/2022.emnlp-main.48/) | EMNLP 2022 |
| Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments (https://aclanthology.org/2025.findings-acl.817/) | ACL 2025 Findings |
| Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (https://openreview.net/forum?id=_VjQlMeSB_J) | NeurIPS 2022 |
混合符号–神经执行
| 论文 | 发表会议/期刊 |
|---|---|
| Self-Verifying Reflection Helps Transformers with CoT Reasoning (https://neurips.cc/virtual/2025/poster/119948) | NeurIPS 2025 |
| SSR: Socratic Self-Refine for Large Language Model Reasoning (https://arxiv.org/abs/2511.10621) | arXiv 2025 |
| CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance (https://arxiv.org/abs/2502.04350) | ICML 2025 |
| Graph of Thoughts: Solving Elaborate Problems with Large Language Models (https://ojs.aaai.org/index.php/AAAI/article/view/29720) | AAAI 2024 |
| Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation (https://arxiv.org/abs/2503.01700) | IROS 2025 |
迭代代码接地推理
| 论文 | 发表会议/期刊 |
|---|---|
| NExT: Teaching Large Language Models to Reason about Code Execution (https://arxiv.org/abs/2404.14662) | ICML 2024 |
| What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces (https://arxiv.org/abs/2503.05703) | arXiv 2025 |
| Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation (https://arxiv.org/abs/2412.15118) | ICML 2025 |
| CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment (https://arxiv.org/abs/2510.18471) | arXiv 2025 |
| RLTF: Reinforcement Learning from Unit Test Feedback (https://arxiv.org/abs/2307.04349) | TMLR 2023 |
| RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning (https://arxiv.org/abs/2410.02089) | ICML 2025 |
| Execution guided line-by-line code generation (https://openreview.net/forum?id=ySFDPoiANu) | NeurIPS 2025 |
| R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning (https://arxiv.org/abs/2505.21668) | arXiv 2025 |
| CYCLE: Learning to Self-Refine the Code Generation (https://dl.acm.org/doi/full/10.1145/3649825) | OOPSLA 2024 |
| StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback (https://aclanthology.org/2024.acl-long.251/) | ACL 2024 |
| CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (https://openreview.net/forum?id=WaGvb7OzySA) | NeurIPS 2022 |
| CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation (https://aclanthology.org/2025.findings-acl.428/) | ACL 2025 Findings |
| SatLM: Satisfiability-Aided Language Models Using Declarative Prompting (https://openreview.net/forum?id=8tt9KxyV2s) | NeurIPS 2023 |
| Self-Edit: Fault-Aware Code Editor for Code Generation (https://aclanthology.org/2023.acl-long.45/) | ACL 2023 |
🤖 用于行动的代码
生成的程序用作策略、工具调用、行为树或可复用技能,适用于具身、GUI、软件和工具使用环境。
接地技能选择
| 论文 | 发表会议/期刊 |
|---|---|
| Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (https://arxiv.org/abs/2204.01691) | CoRL 2022 |
| Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners (https://arxiv.org/abs/2307.01928) | CoRL 2023 |
| Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance (https://arxiv.org/abs/2310.10021) | CoRL 2023 |
| SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse (https://arxiv.org/abs/2603.03836) | arXiv 2026 |
| Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (https://proceedings.mlr.press/v229/ha23a.html) | CoRL 2023 |
| Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models (https://ieeexplore.ieee.org/document/10611448/) | ICRA 2024 |
程序化策略生成
| 论文 | 发表会议/期刊 |
|---|---|
| RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis (https://arxiv.org/abs/2402.16117) | ICML 2024 |
| CP-Agent: Agentic Constraint Programming (https://arxiv.org/abs/2508.07468) | arXiv 2025 |
| LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation (https://arxiv.org/abs/2512.02002) | ICRA 2026 |
| NormCode: A Semi-Formal Language for Auditable AI Planning (https://arxiv.org/abs/2512.10563) | arXiv 2025 |
| ALRM: Agentic LLM for Robotic Manipulation (https://arxiv.org/abs/2601.19510) | arXiv 2026 |
| RACAS: Controlling Diverse Robots With a Single Agentic System (https://arxiv.org/abs/2603.05621) | arXiv 2026 |
| ReAct: Synergizing Reasoning and Acting in Language Models (https://openreview.net/forum?id=WE_vluYUL-X) | ICLR 2023 |
| GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models (https://www.nature.com/articles/s44182-025-00065-w) | npj Robotics 2026 |
| Code as Policies: Language Model Programs for Embodied Control (https://ieeexplore.ieee.org/document/10160591/) | ICRA 2023 |
| Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation (https://arxiv.org/abs/2501.04268) | arXiv 2025 |
| Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models (https://www.ijcai.org/proceedings/2025/980) | IJCAI 2025 |
终身代码型智能体
| 论文 | 发表会议/期刊 |
|---|---|
| Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills (https://arxiv.org/abs/2509.18597) | arXiv 2025 |
| ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning (https://arxiv.org/abs/2509.24219) | arXiv 2025 |
| UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience (https://arxiv.org/abs/2603.24533) | arXiv 2026 |
| Voyager: An Open-Ended Embodied Agent with Large Language Models (https://openreview.net/forum?id=ehfRiF0R3a) | TMLR 2023 |
| Lifelong Language-Conditioned Robotic Manipulation Learning (https://arxiv.org/abs/2603.05160) | arXiv 2026 |
🌍 用于环境建模的代码
程序状态、仓库、轨迹、模拟器和测试表示智能体交互的状态、动态和反馈信号。
结构化世界表示
| 论文 | 发表会议/期刊 |
|---|---|
| From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries (https://openreview.net/forum?id=Ew8bJkSt3g) | NeurIPS 2025 |
| PoE-World: Compositional World Modeling with Products of Programmatic Experts (https://openreview.net/forum?id=obwRcksFZw) | NeurIPS 2025 |
| Code2World: A GUI World Model via Renderable Code Generation (https://arxiv.org/abs/2602.09856) | arXiv 2026 |
| Code2Worlds: Empowering Coding LLMs for 4D World Generation (https://arxiv.org/abs/2602.11757) | arXiv 2026 |
| ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation (https://aclanthology.org/2023.emnlp-main.824/) | EMNLP 2023 |
执行轨迹世界建模
| 论文 | 发表会议/期刊 |
|---|---|
| SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning (https://arxiv.org/abs/2406.01006) | NeurIPS 2024 |
| CWM: An Open-Weights LLM for Research on Code Generation with World Models (https://arxiv.org/abs/2510.02387) | arXiv 2025 |
| Reinforcement World Model Learning for LLM-based Agents (https://arxiv.org/abs/2602.05842) | arXiv 2026 |
| Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning (https://arxiv.org/abs/2602.10090) | arXiv 2026 |
| Aligning Agentic World Models via Knowledgeable Experience Learning (https://arxiv.org/abs/2601.13247) | arXiv 2026 |
| WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment (https://proceedings.neurips.cc/paper_files/paper/2024/file/820c61a0cd419163ccbd2c33b268816e-Paper-Conference.pdf) | NeurIPS 2024 |
代码接地评估环境
| 论文 | 发表会议/期刊 |
|---|---|
| CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution (https://arxiv.org/abs/2401.03065) | ICML 2024 |
| LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (https://openreview.net/forum?id=chfJJYC3iL) | ICLR 2025 |
| SWE-bench: Can Language Models Resolve Real-world Github Issues? (https://arxiv.org/abs/2310.06770) | ICLR 2024 |
| AgentBench: Evaluating LLMs as Agents (https://arxiv.org/abs/2308.03688) | ICLR 2024 |
| CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks (https://neurips.cc/virtual/2025/poster/121601) | NeurIPS 2025 |
| Geogrambench: Benchmarking the geometric program reasoning in modern LLMs (https://arxiv.org/abs/2505.17653) | arXiv 2025 |
| CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis (https://arxiv.org/abs/2602.13962) | arXiv 2026 |
| Endless Terminals: Scaling RL Environments for Terminal Agents (https://arxiv.org/abs/2601.16443) | arXiv 2026 |
| Reflexion: Language Agents with Verbal Reinforcement Learning (https://openreview.net/forum?id=vAElhFcKW6) | NeurIPS 2023 |
| CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution (https://aclanthology.org/2025.acl-long.1158/) | ACL 2025 |
| InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback (https://proceedings.neurips.cc/paper_files/paper/2023/hash/4b175d846fb008d540d233c188379ff9-Abstract-Datasets_and_Benchmarks.html) | NeurIPS 2023 |
🛠️ 操控框架机制
当代码被放入智能体循环后,操控框架必须决定接下来执行什么、保留有用的状态、暴露正确的工具,并将失败转化为纠正行动。
操控框架机制示意图
🗺️ 代码智能体的规划
规划是操控框架的控制:它结构化智能体如何将意图外化为可执行步骤,安排与代码产物和工具的交互,并调节轨迹。
相似文章
代码即代理框架
本综述论文提出了一个统一视角,将代码视为代理系统中代理推理与执行的操作基础,围绕三个层次组织讨论:框架接口、机制与扩展。
@rohanpaul_ai: 这篇来自Meta、斯坦福和伊利诺伊的调研论文认为,当代码成为AI智能体的主要工作层时,它们的效果更好…
这篇来自Meta、斯坦福和伊利诺伊的调研论文认为,当代码被用作AI智能体的主要工作层时,它们表现更好,将代码视为推理、行动和建模的环境。作者引入了‘智能体框架’的概念,包含工具、内存、沙箱和反馈循环。
@FakeMaidenMaker: awesome-harness-engineering,这个项目收录的知识含金量远超这个数字——OpenAI、Anthropic、微软、Meta 的一线工程实践全在里头。 GitHub:https://github.com/ai-boos…
awesome-harness-engineering 是一个收录了来自 OpenAI、Anthropic、微软、Meta 等公司关于 AI agent harness 工程(上下文管理、工具设计、验证回路、记忆系统等)实践资料的精选资源列表,旨在帮助开发者构建可靠的 agent 框架。
@tom_doerr: 智能体深度研究资源精选列表 https://github.com/DavidZWZ/Awesome-Deep-Research…
本文介绍了 'Awesome-Deep-Research',这是一个精选的 GitHub 仓库,聚合了与智能体深度研究相关的资源、工具和论文。
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2057153343081111582
UIUC、Meta和斯坦福大学联合发布的一份100页调查报告引入了人工智能代理的三个 harness 层(接口、机制、Scaling),认为大多数代理失败源于 harness 问题而非推理缺陷,并提供了一个用于审计代理堆栈的分类体系。