@Xudong07452910: 当 AI 开始自主研究 AI,真正开源的可能不是代码,而是一套科研协议。 DeepSeek 研究员陈德里开源的 Deli AutoResearch SKILL,很值得看,它是让 AI 长期做研究的运行规则。 它不是一个复杂代码库,而是一份…

X AI KOLs Timeline 工具

摘要

DeepSeek研究员陈德里开源了Deli AutoResearch SKILL,这是一份SKILL.md协议文件,定义了AI长期自主研究的运行规则,包括状态持久化、停滞检测、心跳机制等,旨在将自主科研从愿景拆解为可持续运行的工程化闭环。

当 AI 开始自主研究 AI,真正开源的可能不是代码,而是一套科研协议。 DeepSeek 研究员陈德里开源的 Deli AutoResearch SKILL,很值得看,它是让 AI 长期做研究的运行规则。 它不是一个复杂代码库,而是一份 SKILL.md 协议文件,专门解决长周期 Agent 最容易翻车的几个问题:反复原地打转、做完一段就停下来等人、上下文一压缩循环就悄悄死掉。 它的做法很工程化:把状态写进文件,不依赖聊天上下文;每轮任务强制记录进展和方向;遇到停滞就换结构性方向;再用心跳机制盯着 Agent 有没有真的活着。 最有意思的地方在于,这套东西把「自主科研」从一个玄学愿景,拆成了更具体的运行规则。 Agent 不只是写论文、跑实验、调 GPU、做总结,而是开始被放进一个能持续迭代、能自查、能防死循环的科研闭环里。 当然,它还远没到完全替代科研人员。页面里也说得很诚实:评分来自模拟评审,引用幻觉仍然存在,最长连续运行也保留了人类方向性输入。 但这件事真正有启发的地方在于:未来科研 Agent 的关键,可能不只是模型更强,而是有没有一套能让它长期稳定工作的协议、状态系统和纠错机制。 以前我们关注「AI 会不会做研究」。 现在更值得问的是: 谁能先把 AI 做研究这件事,变成一条可持续运行的生产线? https://victorchen96.github.io/auto_research/framework.html…
查看原文
查看缓存全文

缓存时间: 2026/06/20 18:21

当 AI 开始自主研究 AI,真正开源的可能不是代码,而是一套科研协议。

DeepSeek 研究员陈德里开源的 Deli AutoResearch SKILL,很值得看,它是让 AI 长期做研究的运行规则。

它不是一个复杂代码库,而是一份 SKILL.md 协议文件,专门解决长周期 Agent 最容易翻车的几个问题:反复原地打转、做完一段就停下来等人、上下文一压缩循环就悄悄死掉。

它的做法很工程化:把状态写进文件,不依赖聊天上下文;每轮任务强制记录进展和方向;遇到停滞就换结构性方向;再用心跳机制盯着 Agent 有没有真的活着。

最有意思的地方在于,这套东西把「自主科研」从一个玄学愿景,拆成了更具体的运行规则。

Agent 不只是写论文、跑实验、调 GPU、做总结,而是开始被放进一个能持续迭代、能自查、能防死循环的科研闭环里。

当然,它还远没到完全替代科研人员。页面里也说得很诚实:评分来自模拟评审,引用幻觉仍然存在,最长连续运行也保留了人类方向性输入。

但这件事真正有启发的地方在于:未来科研 Agent 的关键,可能不只是模型更强,而是有没有一套能让它长期稳定工作的协议、状态系统和纠错机制。

以前我们关注「AI 会不会做研究」。

现在更值得问的是:

谁能先把 AI 做研究这件事,变成一条可持续运行的生产线? https://victorchen96.github.io/auto_research/framework.html…


Deli_AutoResearch — Autonomous Research Framework

Source: https://victorchen96.github.io/auto_research/framework.html The skill itself is one self-contained Markdown document.

It depends on no external resource or private infrastructure — a single SKILL.md fully defines the whole protocol: motivation, behavioral constraints, architecture, state files, stall detection, heartbeat watchdog, scheduling patterns, and engineering constraints. Below is a structured reading guide; the complete SKILL.md is appended at the bottom, copyable in one click.

↓ Jump to full SKILL.md

01Motivation: Three Failure Modes

Long-running code agents exhibit three recurring failure modes whose common cause is missing engineering scaffolding, not model capability. Every mechanism in the framework targets them.

01

Cognitive Loop

Successive iterations try similar directions with diminishing returns, unable to escape a local optimum on their own.

02

Stalling

The agent finishes a chunk, summarizes, and waits for feedback. Externally the session looks alive and polling runs, but work has stopped — more common than crashes.

03

Runtime Fragility

Context compaction silently breaks the loop; closing a session takes down the timers parasitic on it. Failures go unnoticed by default.

02Behavioral Constraints

Hard rules of the framework, each induced from a real failure.

i

Zero interaction

No prompting the user during a run: no Plan Mode, no question tool, no ending on a question. Continue until stopped. Resolve ambiguity yourself and log the reasoning (level=decision).

ii

Ready means execute

The most common hidden violation: finishing all prep then asking “should I submit?”. Prep exists to be executed; submitting, resubmitting, fixing, and starting monitors are routine — no confirmation needed.

iii

Callback means report-alive

After context compaction the loop silently dies. The first action of every callback updates its own last_seen, then checks liveness; on failure it restarts immediately and logs it.

iv

Persist state to files

All progress is written to state/ files, not conversation memory. Each iteration starts a fresh session, injecting only curated state; never resume.

v

Guardian / worker separation

A heartbeat patrol may only do three things to tasks that aren’t its own: liveness-check, restart, nudge. It does not read their data, modify their state, or report for them. Basis: a patrol once overstepped into another task’s business, causing context pollution, reporting drift, and concurrent-write risk.

03Architecture

The orchestrator monitors state, detects stalls, and injects new directions; each task runs in its own fresh session. Three core decisions: separate execution from evaluation, prefer fresh sessions over resume, and enforce direction diversity.

┌── Orchestrator (current session / durable cron) ─────────┐│monitor state files → detect stalls → inject new direction│└──────┬──────────────────┬──────────────────┬────────────┘▼ ▼ ▼[Task A][Task B][Task C]← each its own fresh session

04State File System

Each task keeps its own state and log directories. Three process types write separate log streams, so debugging never needs cross-file correlation.

{task}/state/ ├── task_spec.md# goal / milestones / success criteria├── progress.json# {iteration, status, stale_count, ...}├── findings.jsonl# accumulated findings (append-only)├── directions_tried.json# directions tried (basis for diversity)└── iteration_log.jsonl# per-iteration summary{task}/logs/ ├── work.jsonl# work agent; decisions tagged level=decision├── orchestrator.jsonl# orchestrator└── heartbeat.jsonl# heartbeat watchdog

05Stall Detection & Pivoting

MechanismRuleStall detectionAn iteration with 0 new findings or a metric drop → stale_count + 1Forced pivotstale_count ≥ 2 → change a structural constraint, not tactical parameters; ≥ 4 → flag for human attentionDirection diversityA new direction must differ from every tried one; after a stall, inject perturbation (start from the opposite hypothesis, find structurally similar cross-domain cases, etc.)Round capA single work session caps at 15 rounds or 30 minutesWhy pivot structure, not tactics

This comes from practice: when a task stalls repeatedly within a frame, the decisive gain usually comes from correcting the environment/structural constraint itself, not from tuning strategy parameters harder inside the existing frame. Two stalls should prompt questioning the environment, not deeper digging in one direction.

06Heartbeat Watchdog (3-Layer)

The business loop is itself unreliable and needs an independent guardian layer. Three mutually-checking layers: any one dying can be detected and recovered by another.

LayerFormRoleL0A resident shell guard depending on no sessionHeartbeat timestamp stale > 2h → spin up an emergency patrol via a headless agentL1A durable scheduled job, hourlyCheck each loop’s last_seen, restart timed-out loops, detect stalling and nudgeL2Business loops, each in its own sessionFirst line of each callback updates its own last_seenStall detection threshold

If progress has no update for over 2h and the last output is a question → judged stalled, launch a nudge subagent (inject the task’s task_spec and progress, instruct it to continue and update state). Three consecutive nudges with no progress → judged structurally stuck; stop nudging and reopen with a new direction. The 2h threshold is deliberately shorter than the 4h stuck-task threshold: stalling is a voluntary stop, cheap to fix, worth catching earlier.

07Subagent Scheduling Patterns

PatternUseKey IdeaA Goal-drivenResearch iterationInject tried directions, require verifiable findings, write back to findings.jsonlB Parallel explorationComplex sub-problemsFire multiple agents in one message: investigation, refutation, cross-domain analogyC Experiment runLong compute jobsStart minute-level polling right after submit: auto-diagnose errors, fix, resubmitD VerificationPost-iteration QAAn independent subagent audits the evidence chain of findingsA subagent prompt should include: background, a verifiable deliverable, working directory, file/line caps, and completion criteria.

08Engineering Constraints

Induced by the meta-learning loop from real failures; violating them empirically caused stalls or regressions.

1

At most 5 large files per iteration; no single file over 300 lines.

2

State is injected via files, not conversation history.

3

Validation (test / compile / check) must run between iterations.

4

Citation-like content is verified every 20 entries, never batched up.

5

With multiple candidate directions, prefer adding diversity over digging one deeper.

6

Unresolvable external-dependency failures escalate: full report + notify the owner + poll for a reply; never abandon silently.

09Validation & Limits

The framework has carried several heterogeneous long-horizon tasks. Output of the paper-writing track (pages / citations / in-framework self-rating):

PaperPagesCitationsSelf-ratedAutonomous Research Agents592288.0Continual Learning653268.0Long-Horizon Decision-Making553848.0Self-Play (285B RL experiment + theory hardening)752178.6Limits (honest disclosure)

  1. Scores come from in-framework multi-persona simulated review; comparable only longitudinally within the same protocol, not an external quality claim.
  2. Longest continuous run was 72 hours with 6 directional human inputs — zero operational intervention, directional intervention retained.
  3. Fabricated citations and data artifacts originate from the LLM itself; the framework makes external checking a mechanical step in the process, it does not remove the error source.
  4. Separation of duties relies on protocol constraints, not model discipline; removing the constraints brings overstepping back.

10Full SKILL.md

The authoritative source for the guide above — one self-contained Markdown document depending on no external resource. Expand to read; copy in one click.

▶Expand full SKILL.mdCopy```

name: Deli_AutoResearch description: A protocol framework for long-horizon autonomous tasks. Targets three empirically-observed failure modes — cognitive loops, stalling, runtime fragility — by prescribing state management, stall detection, and watchdog mechanisms. Validated on multiple task types including paper writing (4 ICLR-format surveys, in-framework self-rating 8.0-8.6/10). type: Agent Framework tags: autonomous, long-horizon, zero-interaction, anti-loop, heartbeat-watchdog, loop, multi-agent, unattended, orchestration

Deli_AutoResearch

This skill is a protocol framework for long-horizon autonomous tasks (days to weeks). It ships no executable code; instead it prescribes a set of battle-tested conventions: how state is persisted, how stalls are detected, how guardians are layered, and what constraints bind agent behavior. Implementation details are left to the adopter’s environment.

1. Motivation

Long-running code agents exhibit three recurring failure modes:

  1. Cognitive loop — successive iterations try similar directions with diminishing returns, unable to escape a local optimum on their own.
  2. Stalling — the agent finishes a chunk of work, outputs a summary, and waits for user feedback. Externally the session looks alive and polling runs, but work has effectively stopped. Run logs show this is more common than crashes.
  3. Runtime fragility — context compaction silently breaks the loop; closing a session takes down the timers parasitic on it. Failures go unnoticed by default.

The common cause of all three is missing engineering scaffolding, not insufficient model capability. Every mechanism in this framework targets the failure modes above.

2. Behavioral Constraints

  1. Zero interaction — no prompting the user during a run: no Plan Mode, no question tool, no ending on a question. Continue working until the user stops you. Resolve ambiguity yourself and write the reasoning to the log (level=decision).
  2. Ready means execute — the most common hidden violation: finishing all preparation and then asking “should I submit?”. The purpose of preparation is execution; submitting, resubmitting, fixing, and starting monitors are all routine operations needing no confirmation.
  3. Callback means report-alive — after context compaction the loop dies silently. The first action of every callback is to update its own last_seen, then check liveness; on detecting failure it restarts immediately and logs it.
  4. Persist state to files — all progress is written to state/ files, not conversation memory. Each iteration starts a fresh session, injecting only curated state; never use resume.
  5. Guardian / worker separation — a heartbeat patrol may take only three actions on tasks that are not its own: liveness-check, restart, nudge. It does not read their data, modify their state files, or report to the user on their behalf.

3. Architecture

┌── Orchestrator (current session / durable cron) ──┐
│ monitor state files → detect stalls → inject direction │
└────┬─────────────┬─────────────┬────────────┘
  [Task A]      [Task B]      [Task C]   ← each its own fresh session

Core design decisions:

  • Separate execution from evaluation — the agent doing the work does not judge its own progress; stall determination is made by the orchestration layer based on quantitative metrics.
  • Fresh session over resume — context accumulation is the primary cause of cognitive loops. Each iteration starts with fresh context; state is injected via files.
  • Enforced direction diversity — before each iteration, read the list of tried directions; a new direction must differ from all history.

4. State Files

{task}/state/
├── task_spec.md           # goal / milestones / success criteria
├── progress.json          # {iteration, total_findings, status, stale_count}
├── findings.jsonl         # accumulated findings (append-only)
├── directions_tried.json  # directions already tried
└── iteration_log.jsonl    # per-iteration summary

{task}/logs/
├── work.jsonl             # written by work agent; decisions tagged level=decision
├── orchestrator.jsonl     # written by orchestrator
└── heartbeat.jsonl        # written by heartbeat watchdog

Log line format: {“ts”:“…”, “source”:“…”, “level”:“info|warn|error|decision”, “event”:“…”, “detail”:“…”}

5. Usage

# 1. Initialize the task directory, write state/task_spec.md and an initial progress.json

# 2. Start the orchestrator loop:
/loop 2h check all tasks under : (1) read progress.json;
(2) if stale_count>=3 generate a fresh direction; (3) launch a work agent
via the Agent tool (with explicit goal and completion criteria);
(4) write results back to state files. Zero interaction.

# 3. Register a durable heartbeat watchdog (survives across sessions):
hourly patrol: write a timestamp; check each loop's last_seen against interval×3,
restart if exceeded; check each task's progress for stalls over 2h, nudge if stalled.
Zero interaction.

6. Stall Detection & Pivoting

MechanismRule
Stall detectionan iteration with 0 new findings or a metric drop → stale_count + 1
Forced pivotstale_count >= 2 → change a structural constraint, not tactical parameters; >= 4 → flag for human attention
Direction diversitya new direction must differ from every tried one; after a stall, inject a perturbation strategy
Round capa single work session caps at 15 rounds or 30 minutes

“Pivot structure, not tactics” comes from practice: when a task stalls repeatedly within a frame, the decisive gain usually comes from correcting the environment/structural constraint itself, not from tuning strategy parameters harder inside the existing frame.

7. Heartbeat Watchdog

The business loop is itself unreliable and needs an independent guardian layer. Three mutually-checking layers (V3):

LayerFormDepends onRole
L0resident shell guardno sessionheartbeat stale > 2h → spin up an emergency patrol via a headless agent
L1durable cron, hourlya living interactive sessioncheck each loop’s last_seen, restart timed-out loops, detect stalling and nudge
L2business loopeach its own sessionfirst line of each callback updates its own last_seen

Any one layer dying can be detected and recovered by another.

Stall detection: if progress has no update for over 2 hours and the last output is a question → judged stalled, launch a nudge subagent. Three consecutive nudges with no progress → judged structurally stuck; stop nudging and reopen with a new direction. The 2h threshold is deliberately shorter than the 4h stuck-task threshold.

8. Subagent Scheduling Patterns

PatternUseKey idea
A Goal-drivenresearch iterationinject tried directions, require verifiable findings, write back to findings.jsonl
B Parallel explorationcomplex sub-problemsfire multiple agents in one message: investigation, refutation, cross-domain analogy
C Experiment runlong compute jobsstart minute-level polling right after submit: auto-diagnose errors, fix, resubmit
D Verificationpost-iteration QAan independent subagent audits the evidence chain of findings

A subagent prompt should include: background, a verifiable deliverable, working directory, file/line caps, and completion criteria.

9. Engineering Constraints

  1. At most 5 large files per iteration; no single file over 300 lines.
  2. State is injected via files, not conversation history.
  3. Validation (test / compile / check) must run between iterations.
  4. Citation-like content is verified every 20 entries, never batched up.
  5. With multiple candidate directions, prefer adding diversity over digging one deeper.
  6. Unresolvable external-dependency failures escalate (full report + notify the owner + poll for a reply); never abandon silently.

10. Validation & Limits

The framework has carried several heterogeneous tasks: academic paper writing, long-horizon research, etc. Paper-track output:

PaperPagesCitationsSelf-rated
Autonomous Research Agents592288.0/10
Continual Learning653268.0/10
Long-Horizon Decision-Making553848.0/10
Self-Play (285B RL experiment + theory hardening)752178.6/10

Limits:

  1. Scores come from in-framework multi-persona simulated review; comparable only longitudinally within the same protocol, not an external quality claim.
  2. The longest continuous run on record was 72 hours, with 6 directional human inputs during it — zero operational intervention, directional intervention retained.
  3. Fabricated citations and data artifacts originate from the LLM itself; the framework makes external checking a mechanical step in the process, it does not remove the error source.
  4. Separation of duties relies on protocol constraints, not model self-discipline; removing the constraints brings overstepping behavior back.

## Related Pages

相似文章

@PierceZhang34: 分享一个专注于 AI 辅助科研的开放共建仓库 Awesome Vibe Research 项目核心目标它收集和沉淀科研全流程(从想法生成到论文发表、传播)中可复用、可验证、可演化的 AI 辅助组件,包括: Agents(智能体) Skil…

X AI KOLs Timeline

分享了一个由 ModelScope 维护的开放共建仓库 Awesome Vibe Research,该仓库收集并沉淀了科研全流程中可复用、可验证、可演化的 AI 辅助组件,包括智能体、技能包、工作流、工具和最佳实践,旨在帮助科研人员和开发者利用 AI 提升研究效率。

@Xudong07452910: 开源项目推荐:loop-engineering —— 让你的 AI 编码 Agent 拥有自循环与智能编排能力的实用框架 loop-engineering 是目前很火的概念,该项目提供了实用模式、启动器和 CLI 工具,帮助开发者设计系统…

X AI KOLs Timeline

loop-engineering 是一个开源框架,为 AI 编码代理(如 Claude Code、Codex、Cursor)提供自循环和智能编排能力,包含 7 个生产级循环模式、实用 CLI 工具和五大数据块设计,帮助开发者从手动提示转向系统化自动化。

@WWTLitee: 有没有什么办法让AI自主迭代优化? 有,来看看这个 autoresearch 它的核心不是让 AI 直接“发明论文”,而是把研究过程拆成一个可验证循环:人类写 program.md 给研究方向,AI agent 修改 http://tra…

X AI KOLs Timeline

介绍了autoresearch项目,它将AI研究过程拆解为可验证的循环(固定环境、单一可编辑文件、固定指标、Git回滚),使AI agent能进行可控、可复现的实验迭代;同时提及了12-factor-agents清单。

@gyro_ai: https://x.com/gyro_ai/status/2055198700016660826

X AI KOLs Timeline

Matt Pocock 开源了 Skills for Real Engineers,一套小、可组合、可破解的 AI 编程技能,旨在解决 AI 编程中的理解偏差、缺少共享语言、反馈回路缺失和软件熵问题。该工具通过 grill-with-docs、tdd、diagnose 等技能提升 AI 编程效率,并提供了完整工作流。