@krystal_ning: 感谢分享我们的调研!我们还在维护一个 Awesome Code as Agent Harness Papers 仓库,用于收集近期关于…

X AI KOLs Following 工具

摘要

Krystal Ning 分享了一个精选的 Awesome 列表仓库,收录关于以代码为中心的智能体系统和工具链工程的论文,该列表伴随一项名为“Code as Agent Harness”的调研。

感谢分享我们的调研!我们还在维护一个 Awesome Code as Agent Harness Papers 仓库,用于收集近期关于以代码为中心的智能体系统和工具链工程的工作:https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers…
查看原文
查看缓存全文

缓存时间: 2026/05/20 06:25

感谢您分享我们的调查!我们还维护了一个 Awesome Code as Agent Harness Papers 仓库,收录了关于以代码为中心的智能体系统与操控框架工程的最新研究:https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers…


YennNing/Awesome-Code-as-Agent-Harness-Papers

来源:https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers

Awesome Code as Agent Harness Papers(代码作为智能体操控框架论文集锦)

Awesome (https://awesome.re)
arXiv (https://arxiv.org/abs/2605.18747)
官方网站 (https://code-as-harness.github.io/code-as-harness-webpage/)
Hugging Face 当日最佳论文 #1 (https://huggingface.co/papers/2605.18747)
@_akhaliq (https://x.com/_akhaliq/status/2056900568921133565?s=20)
访问者统计

本仓库是综述论文 《Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems》(代码作为智能体操控框架:迈向可执行、可验证、有状态的智能体系统)(https://arxiv.org/abs/2605.18747) 的配套资源。
我们研究代码在智能体 AI 中新兴的角色:代码不再仅仅是生成的产物,而是日益成为一种可执行、可检查、有状态的操控框架,智能体通过它进行推理、行动、建模环境、接收反馈以及协调。本仓库围绕三个相互关联的层次组织代表性论文:操控框架接口(Harness Interface)操控框架机制(Harness Mechanisms)操控框架扩展(Scaling the Harness),涵盖了编码助手、GUI/OS 自动化、科学发现和具身智能等方向。

👋 我们欢迎论文建议、拉取请求以及与代码作为智能体操控框架相关的合作。请联系 [email protected][email protected][email protected][email protected][email protected]。我们将持续更新本仓库,收录以代码为中心的智能体系统与操控框架工程的最新工作。

📚 如果您觉得本资源有用,请引用并给仓库点星 (https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers):

@article{ning2026codeasharness,
  title   = {Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems},
  author  = {Ning, Xuying and Tieu, Katherine and Fu, Dongqi and Wei, Tianxin and Li, Zihao and Bei, Yuanchen and others},
  journal = {arXiv preprint arXiv:2605.18747},
  year    = {2026}
}

框架总览图

🔔 新闻

[2026-05] 🚀 我们的综述 《Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems》 已在 arXiv (https://arxiv.org/abs/2605.18747) 上线。幻灯片和项目页面链接将在可用后添加。

📋 目录


🧩 操控框架接口

代码作为模型与任务环境之间的基本接口。程序将模型输出转化为可执行、可检查、有状态的结构:代码使推理变得可执行,行动变得可编程,环境状态变得可检查

操控框架接口示意图

💭 用于推理的代码

程序将内部逻辑外化为可验证的计算,允许解释器、符号求解器、执行轨迹或过程奖励来检查和优化中间步骤。

程序委托推理

论文发表会议/期刊
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks (https://arxiv.org/abs/2211.12588)TMLR 2023
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning (https://arxiv.org/abs/2310.03731)ICLR 2024
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator (https://arxiv.org/abs/2312.04474)ICML 2024
Method-Based Reasoning for Large Language Models: Extraction, Reuse, and Continuous Improvement (https://arxiv.org/abs/2508.04289)arXiv 2025
Code-Enabled Language Models Can Outperform Reasoning Models on Diverse Tasks (https://arxiv.org/abs/2510.20909)arXiv 2025
When Do Program-of-Thought Works for Reasoning? (https://ojs.aaai.org/index.php/AAAI/article/view/29721)AAAI 2024
PAL: Program-aided Language Models (https://proceedings.mlr.press/v202/gao23f.html)ICML 2023
Show Your Work: Scratchpads for Intermediate Computation with Language Models (https://arxiv.org/abs/2112.00114)arXiv 2021
Reasoning Like Program Executors (https://aclanthology.org/2022.emnlp-main.48/)EMNLP 2022
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments (https://aclanthology.org/2025.findings-acl.817/)ACL 2025 Findings
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (https://openreview.net/forum?id=_VjQlMeSB_J)NeurIPS 2022

混合符号–神经执行

论文发表会议/期刊
Self-Verifying Reflection Helps Transformers with CoT Reasoning (https://neurips.cc/virtual/2025/poster/119948)NeurIPS 2025
SSR: Socratic Self-Refine for Large Language Model Reasoning (https://arxiv.org/abs/2511.10621)arXiv 2025
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance (https://arxiv.org/abs/2502.04350)ICML 2025
Graph of Thoughts: Solving Elaborate Problems with Large Language Models (https://ojs.aaai.org/index.php/AAAI/article/view/29720)AAAI 2024
Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation (https://arxiv.org/abs/2503.01700)IROS 2025

迭代代码接地推理

论文发表会议/期刊
NExT: Teaching Large Language Models to Reason about Code Execution (https://arxiv.org/abs/2404.14662)ICML 2024
What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces (https://arxiv.org/abs/2503.05703)arXiv 2025
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation (https://arxiv.org/abs/2412.15118)ICML 2025
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment (https://arxiv.org/abs/2510.18471)arXiv 2025
RLTF: Reinforcement Learning from Unit Test Feedback (https://arxiv.org/abs/2307.04349)TMLR 2023
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning (https://arxiv.org/abs/2410.02089)ICML 2025
Execution guided line-by-line code generation (https://openreview.net/forum?id=ySFDPoiANu)NeurIPS 2025
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning (https://arxiv.org/abs/2505.21668)arXiv 2025
CYCLE: Learning to Self-Refine the Code Generation (https://dl.acm.org/doi/full/10.1145/3649825)OOPSLA 2024
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback (https://aclanthology.org/2024.acl-long.251/)ACL 2024
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (https://openreview.net/forum?id=WaGvb7OzySA)NeurIPS 2022
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation (https://aclanthology.org/2025.findings-acl.428/)ACL 2025 Findings
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting (https://openreview.net/forum?id=8tt9KxyV2s)NeurIPS 2023
Self-Edit: Fault-Aware Code Editor for Code Generation (https://aclanthology.org/2023.acl-long.45/)ACL 2023

🤖 用于行动的代码

生成的程序用作策略、工具调用、行为树或可复用技能,适用于具身、GUI、软件和工具使用环境。

接地技能选择

论文发表会议/期刊
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (https://arxiv.org/abs/2204.01691)CoRL 2022
Robots That Ask for Help: Uncertainty Alignment for Large Language Model Planners (https://arxiv.org/abs/2307.01928)CoRL 2023
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance (https://arxiv.org/abs/2310.10021)CoRL 2023
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse (https://arxiv.org/abs/2603.03836)arXiv 2026
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (https://proceedings.mlr.press/v229/ha23a.html)CoRL 2023
Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models (https://ieeexplore.ieee.org/document/10611448/)ICRA 2024

程序化策略生成

论文发表会议/期刊
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis (https://arxiv.org/abs/2402.16117)ICML 2024
CP-Agent: Agentic Constraint Programming (https://arxiv.org/abs/2508.07468)arXiv 2025
LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation (https://arxiv.org/abs/2512.02002)ICRA 2026
NormCode: A Semi-Formal Language for Auditable AI Planning (https://arxiv.org/abs/2512.10563)arXiv 2025
ALRM: Agentic LLM for Robotic Manipulation (https://arxiv.org/abs/2601.19510)arXiv 2026
RACAS: Controlling Diverse Robots With a Single Agentic System (https://arxiv.org/abs/2603.05621)arXiv 2026
ReAct: Synergizing Reasoning and Acting in Language Models (https://openreview.net/forum?id=WE_vluYUL-X)ICLR 2023
GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models (https://www.nature.com/articles/s44182-025-00065-w)npj Robotics 2026
Code as Policies: Language Model Programs for Embodied Control (https://ieeexplore.ieee.org/document/10160591/)ICRA 2023
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation (https://arxiv.org/abs/2501.04268)arXiv 2025
Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models (https://www.ijcai.org/proceedings/2025/980)IJCAI 2025

终身代码型智能体

论文发表会议/期刊
Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills (https://arxiv.org/abs/2509.18597)arXiv 2025
ViReSkill: Vision-Grounded Replanning with Skill Memory for LLM-Based Planning in Lifelong Robot Learning (https://arxiv.org/abs/2509.24219)arXiv 2025
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience (https://arxiv.org/abs/2603.24533)arXiv 2026
Voyager: An Open-Ended Embodied Agent with Large Language Models (https://openreview.net/forum?id=ehfRiF0R3a)TMLR 2023
Lifelong Language-Conditioned Robotic Manipulation Learning (https://arxiv.org/abs/2603.05160)arXiv 2026

🌍 用于环境建模的代码

程序状态、仓库、轨迹、模拟器和测试表示智能体交互的状态、动态和反馈信号。

结构化世界表示

论文发表会议/期刊
From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries (https://openreview.net/forum?id=Ew8bJkSt3g)NeurIPS 2025
PoE-World: Compositional World Modeling with Products of Programmatic Experts (https://openreview.net/forum?id=obwRcksFZw)NeurIPS 2025
Code2World: A GUI World Model via Renderable Code Generation (https://arxiv.org/abs/2602.09856)arXiv 2026
Code2Worlds: Empowering Coding LLMs for 4D World Generation (https://arxiv.org/abs/2602.11757)arXiv 2026
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation (https://aclanthology.org/2023.emnlp-main.824/)EMNLP 2023

执行轨迹世界建模

论文发表会议/期刊
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning (https://arxiv.org/abs/2406.01006)NeurIPS 2024
CWM: An Open-Weights LLM for Research on Code Generation with World Models (https://arxiv.org/abs/2510.02387)arXiv 2025
Reinforcement World Model Learning for LLM-based Agents (https://arxiv.org/abs/2602.05842)arXiv 2026
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning (https://arxiv.org/abs/2602.10090)arXiv 2026
Aligning Agentic World Models via Knowledgeable Experience Learning (https://arxiv.org/abs/2601.13247)arXiv 2026
WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment (https://proceedings.neurips.cc/paper_files/paper/2024/file/820c61a0cd419163ccbd2c33b268816e-Paper-Conference.pdf)NeurIPS 2024

代码接地评估环境

论文发表会议/期刊
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution (https://arxiv.org/abs/2401.03065)ICML 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (https://openreview.net/forum?id=chfJJYC3iL)ICLR 2025
SWE-bench: Can Language Models Resolve Real-world Github Issues? (https://arxiv.org/abs/2310.06770)ICLR 2024
AgentBench: Evaluating LLMs as Agents (https://arxiv.org/abs/2308.03688)ICLR 2024
CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks (https://neurips.cc/virtual/2025/poster/121601)NeurIPS 2025
Geogrambench: Benchmarking the geometric program reasoning in modern LLMs (https://arxiv.org/abs/2505.17653)arXiv 2025
CodeGlance: Understanding Code Reasoning Challenges in LLMs through Multi-Dimensional Feature Analysis (https://arxiv.org/abs/2602.13962)arXiv 2026
Endless Terminals: Scaling RL Environments for Terminal Agents (https://arxiv.org/abs/2601.16443)arXiv 2026
Reflexion: Language Agents with Verbal Reinforcement Learning (https://openreview.net/forum?id=vAElhFcKW6)NeurIPS 2023
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution (https://aclanthology.org/2025.acl-long.1158/)ACL 2025
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback (https://proceedings.neurips.cc/paper_files/paper/2023/hash/4b175d846fb008d540d233c188379ff9-Abstract-Datasets_and_Benchmarks.html)NeurIPS 2023

🛠️ 操控框架机制

当代码被放入智能体循环后,操控框架必须决定接下来执行什么保留有用的状态暴露正确的工具,并将失败转化为纠正行动。

操控框架机制示意图

🗺️ 代码智能体的规划

规划是操控框架的控制:它结构化智能体如何将意图外化为可执行步骤,安排与代码产物和工具的交互,并调节轨迹。

相似文章

代码即代理框架

Hugging Face Daily Papers

本综述论文提出了一个统一视角,将代码视为代理系统中代理推理与执行的操作基础,围绕三个层次组织讨论:框架接口、机制与扩展。

@FakeMaidenMaker: awesome-harness-engineering,这个项目收录的知识含金量远超这个数字——OpenAI、Anthropic、微软、Meta 的一线工程实践全在里头。 GitHub:https://github.com/ai-boos…

X AI KOLs Timeline

awesome-harness-engineering 是一个收录了来自 OpenAI、Anthropic、微软、Meta 等公司关于 AI agent harness 工程(上下文管理、工具设计、验证回路、记忆系统等)实践资料的精选资源列表,旨在帮助开发者构建可靠的 agent 框架。

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2057153343081111582

X AI KOLs Timeline

UIUC、Meta和斯坦福大学联合发布的一份100页调查报告引入了人工智能代理的三个 harness 层(接口、机制、Scaling),认为大多数代理失败源于 harness 问题而非推理缺陷,并提供了一个用于审计代理堆栈的分类体系。