@FinanceYF5: ENPIRE 已能独立完成扎束线带、整理细针、安装 GPU 等高精度操作,并展现出“物理扩展”现象:多机器人并行探索,进步速度明显更快。 NVIDIA GEAR 实验室的一部分如今已能通宵自我改进,人类早上只需查看报告。项目也将开源。 项…
摘要
NVIDIA GEAR lab introduces ENPIRE, a framework for autonomous real-world robot policy self-improvement that achieves 99% success on dexterous manipulation tasks like GPU insertion and zip-tying, with multi-robot parallel learning and open-source release.
查看缓存全文
缓存时间: 2026/06/18 02:02
ENPIRE 已能独立完成扎束线带、整理细针、安装 GPU 等高精度操作,并展现出“物理扩展”现象:多机器人并行探索,进步速度明显更快。 NVIDIA GEAR 实验室的一部分如今已能通宵自我改进,人类早上只需查看报告。项目也将开源。 项目网站:https://research.nvidia.com/labs/gear/enpire/…
Agentic Robot Policy Self-Improvement in the Real World
Source: https://research.nvidia.com/labs/gear/enpire/
ENPIRE: Agentic Robot PolicySelf-Improvement in the Real World
,Jia Xie2†,Tonghe Zhang2†,Haotian Lin2†,Letian “Max” Fu3,Haoru Xue3,Jalen Lu2, Yi Yang2,Cunxi Dai2,Zi Wang1,Jimmy Wu1,Guanzhi Wang1,S. Shankar Sastry3,Ken Goldberg3, Linxi “Jim” Fan1‡,Yuke Zhu1‡,Guanya Shi2‡
1NVIDIA2CMU3UC Berkeley†Equal contribution‡Equal advising



Abstract
Achieving dexterous robotic manipulation in the real world relies heavily on human supervision and algorithmic engineering, which is a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined to digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration.
To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with single or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes.
This closed-loop system transforms real-world robot learning into a controllable optimization procedure that agents can manage, thus minimizing human effort while allowing fair ablations across training recipes and agent variants. Powered by ENPIRE, frontier coding agents can autonomously develop a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks in the real world, such as PushT, organizing pins into a pin box, and using a cutter to cut a zip tie.
Coding agents can improve policies with various PI regimes, such as heuristic learning, tool calling, behavior cloning, offline or online RL. Moreover, ENPIRE can be significantly accelerated on a robot fleet, and we propose two metrics, namely, Mean Robot Utilization (MRU) and Mean Token Utilization (MTU) to measure the efficiency of multiagent physical autoresearch. We also include simulation results in RoboCasa. Our findings suggest a practical and scalable path toward autonomously advancing robotics in the real world.
Learned Manipulation Policy
Policies trained with ENPIRE reach a 99% pass@8 success rate across the showcased manipulation tasks.
Push T
Pin Insertion
GPU Insertion
Tie Ziptie
Cut Ziptie
ENPIRE runs fully autonomously on real robots. Working only through the automated reset and verification interface, a team of coding agents proposes algorithmic hypotheses (heuristic learning, behavior cloning, offline and online RL), tests them against the real-world success rate, and keeps the changes that move it. The idea tree below traces that search as a hypothesis git-tree — one branch per agent, one node per idea tried — plotted on the same wall-clock-time axis as the success-rate curve, so you can see the ideas that moved the curve upward.
I1I2I3I6I7I13I22I24I34I8I36I62I69I70I72I74I75I77I78I79I80I81I82I83I84I85I86I4I9I12I26I37I16I41I44I48I52I55I57I59I61I64I65I67I68I71I73I76I18I29I35I42I45I47I49I50I53I54I56I58I60I63I66I10I23I27I5I11I15I17I19I21I28I33I38I39I40I46I51I14I20I25I30I31I32I43050%100%team-avg best success rateI16 Online RL mix Demo+3.8 ppI37 BC regularization+10.8 ppI56 Tweak BC term weight+0.4 ppI66 Tune batch size 1024→512+0.9 ppI76 Compensate controller+1.3 pp01 h2 h3 hresearch wall-clock time →each dot = an ideaclick any dot to read the ideagreen ring = idea that raised the team-avg scoregreen curve = cross-agent inspiration
**Figure1:**Each coding agent explores its own branch of ideas, one lane per branch. Every dot is an idea it tried; a green ring marks an idea that raised the team’s average success rate, and green curves trace cross-agent inspiration. The lower panel tracks the team’s average success rate climbing over research wall-clock time.
ENPIRE System
Construct Environment
Policy Improvement
Action
Obs
Reward
3
# TODO: auto task reset
4
pick_and_place(obj, target)
8
defget_reward(self, obs, act):
10
mask = sam3(obs[‘left’])
11
pos = boundlsdf(obs, mask)
14
defget_observation(self):
ENPIREEnvironment
01Literature review
PLDRL-TokenCaP-X
02Propose algorithm variant
HeuristicsOff2On RLCode-as-policyBC
03Optimize Infra
Data SamplerParam Sweep
04Summarize experiment result
050%100%team-avg best success rateOnline RL mix Demo+3.8 ppBC regularization+10.8 ppRe-evaluate+12.5 ppTweak BC term weight+0.4 ppTune batch size 1024→512+0.9 ppCompensate controller+1.3 pp01h2h3hresearch wall-clock time →Hillclimb Timeline

GPU insertion
Pin insertion
Push-T
Zip tie cutting Real-world tasks
BUILDtask objective& feedbackRUNrolloutslogs Figure2: ENPIRE system overview, integrated as a native site component with shared typography and animation.## From Robot Hardware to an Agent-Operable Environment
Before an agent can improve a robot policy, the task must become self-resetting and self-verifying. Two capabilities make this possible: automatic evaluation, which scores the outcome of each trial without human judgment, and automatic reset, which returns the scene to a fresh initial state for the next trial.
Auto Evaluation
We use an autoresearch-derived reward function to automatically score the outcome of zip-tie insertion: a detector draws bounding boxes around the zip-tie head and strap, a segmentation model resolves the same parts into masks over the raw view, and each camera view independently judges whether the zip-tie strap passes through the head above a fixed length threshold. The per-camera verdicts are then fused into the final binary reward.
Bounding boxesdetector boxes drawn on the raw top and right views Segmentationpart masks overlaid on the raw top and right viewstop camera
No
right camera
No
Per-camera verdictIs tail covered by head?REWARD=0
Final rewardboth cameras fused into one binary reward
Auto Reset
The reset panels below show the physical loop that makes repeated experiments possible: select a randomized initial state, run the reset behavior, and verify that the trial is ready for the next policy.
Case 1: Push T
Case 2: Pin Insertion
Case 3: Tie Zip-tie
Case 4: GPU Insertion
- Automatic reset returns each task to a known randomized initial state without manual intervention.
- Automatic verification records whether the reset succeeded and exposes representative frames for inspection.
Agents Improve Policies From Physical Feedback
Once the environment is operable, agents edit policy code, run trials, inspect failures, and decide what to change next. The Push-T panel visualizes actual rollout traces from multiple code agents under the same six initial states so the behavior is inspectable, not just summarized by a success rate.
Codex Push-T rollout0% coverage43steps
Claude Code Push-T rollout0% coverage73steps
Kimi Code Push-T rollout0% coverage66steps
In-distribution random initialization
Out-of-distribution random initialization
Evaluate Coding Agent
We evaluate the physical autoresearch capability of three coding agents: Codex with GPT-5.5, Claude Code with Opus 4.7, and Kimi Code with Kimi K2.6. Instead of asking only whether a final policy succeeds, AutoEnvBench tracks agent-driven research progress over wall-clock time across Push-T and Pin Insertion.
0%
Push-T (Heuristic Learning)
Push-T model comparison
024680%25%50%75%100%Normalized ScoreResearch Time (h)0h
Pin Insertion (Gradient-based Learning)
Pin Insertion model comparison
0123450%75%100%Success RateResearch Time (h)0h
Agent:
Figure3: Coding-agent comparison on AutoEnvBench, measuring agent-driven research progress across Push-T and Pin Insertion.## Scaling Autoresearch on Robot Fleets
Scaling the number of agents changes both research progress and hardware pressure. The scaling-law plots compare one-, four-, and eight-agent teams on Push-T and Pin Insertion, while the resource utilization figure shows robot utilization, GPU utilization, token throughput, and the time required to reach task success.
0%
Push-T Scaling
Push-T team scaling
0123450%25%50%75%100%Normalized ScoreResearch Time (h)0h
Pin Insertion Scaling
Pin Insertion team scaling
00.511.50%25%50%75%100%Success RateResearch Time (h)0.0h
Team size:
Figure4: Scaling-law view for AutoEnvBench, comparing one-, four-, and eight-agent research teams across Push-T and Pin Insertion. 0%
Pin Insertionautoresearch with robot fleet Push-T stage-wise research progressGrouped bar chart showing minutes to first move, range satisfaction, orientation satisfaction, and first success for Codex vision ablations.0153045607590105120135Time-to- (min)25385555Codexw/ native vision13207899Codexw/ VLM as vision tool13137272Codexw/ GoFE vision
Figure5: Push-T stage-wise progress under the fleet scaling setup. Bars report time to successive behavioral milestones; whiskers mark first-success variability.BarsRobot / tokensGPU / observed throughputReferenceLinear projection ±10%Time to success
1 agent4 agents8 agents0255075100(a) Mean Resource UtilizationUtilization (%)
1 agent4 agents8 agents050k100k150k(b) Mean Token UtilizationMTU
1 agent4 agents8 agents051015012345(c) Token to SuccessToken to Success (M)Time to success (h)
Figure6: Agent resource utilization across 1, 4, and 8 agents. (a) Mean robot and GPU utilization. (b) Mean token throughput with linear scaling projection. (c) Tokens and time required to reach task success.## Evaluation in Simulation
We also evaluate ENPIRE in simulation to separate agent-driven research behavior from real-world hardware throughput. Simulation tasks let agents run denser ablations, compare policy-improvement regimes under controlled resets, and test whether recipes discovered in the physical loop transfer to broader manipulation settings.
Coffee Setup MugSeed 1·right camera
Seed 1 rollout progress
Coffee Setup MugSeed 2·right camera
Seed 2 rollout progress
Limitations & Future Directions
Robot and compute resources are underutilized
Coding agents do not fully utilize robot resources when they are reading logs, writing code, debugging, or waiting for the language-model backbone. As the number of robots scales, MRU decreases while GPU active utilization increases. Compared to a single-robot setup, agent teams spend more time summarizing peer branches and less time operating the robot, and coding agents may fail to launch enough parallel training sessions to exhaust GPU resources.
Scaling robot fleet causes higher token consumption
Scaling the robot fleet drives higher token consumption: as more agents read logs, summarize peer branches, and coordinate, the total token budget required to reach a successful policy grows with fleet size. Larger fleets can reach success sooner, but the additional speedup comes at the cost of higher token consumption.
Acknowledgements
We are grateful to many colleagues whose help made this project possible. We thank Jason Liu, Tony Tao, Tairan He, Alex Lin, Jim Yang, Paul Zhou, and Abhi Maddukuri for insightful discussions and feedback; Yide Shentu, Bike Zhang, Angchen Xie, Dvij Kalaria, and Yuqi Xie for their support with the experiments; Lion Park, Matin Furutan, Jeremy Chimienti, Dennis Da, and Tri Cao for fleet operation; and Tri Cao for the demo shots. We also thank the NVIDIA GEAR Team and the CMU LeCAR Lab for their continuous support.
相似文章
Nvidia的自主机器人研究(6分钟阅读)
ENPIRE是一个框架,使编码代理能够通过真实世界的反馈循环自主改进机器人操作策略,在插针和剪扎带等灵巧任务上实现了99%的成功率。
@DrJimFan:今天,我们首次在物理世界中启用自动研究!介绍ENPIRE:我们给8个Codex智能体一个机器人舰队……
NVIDIA GEAR实验室推出了ENPIRE,这是一个使用8个Codex智能体自主控制机器人舰队执行物理任务(如扎扎带、安装GPU)的系统,展示了自我改进的机器人研究以及一种新的'physical scaling'现象。
@FinanceYF5: Jim Fan 宣布,团队首次将 AutoResearch 带入物理世界,推出 ENPIRE。 他们给 8 个 Codex 智能体配备机器人、GPU 和充足 token,让它们在真实硬件上自主学习、试错、协作和完成任务。
Jim Fan 宣布团队推出 ENPIRE,首次将 AutoResearch 带入物理世界,为 8 个 Codex 智能体配备机器人、GPU 和 token,使其在真实硬件上自主学习和协作完成任务。
ENPIRE: 现实世界中自主机器人策略自我改进
ENPIRE是一个框架,通过环境反馈、策略优化和进化代码优化的闭环系统,使机器人能够在现实世界中自主实现策略自我改进,在灵巧操作任务上达到99%的成功率。
NVIDIA的AI智能体教会机器人无需人类帮助将GPU安装到主板上
NVIDIA与卡内基梅隆大学(CMU)和加州大学伯克利分校(UC Berkeley)共同开发的ENPIRE框架,利用AI编码智能体自主训练机器人执行高精度物理任务(如GPU安装),通过闭环反馈和真实硬件测试实现了99%的成功率。