@Xudong07452910: When AI starts researching AI autonomously, what's truly open-sourced may not be the code, but a research protocol. DeepSeek researcher Deli Chen's open-source Deli AutoResearch SKILL is worth a look — it's a set of rules for AI to conduct long-term research. It's not a complex codebase, but a...
Summary
DeepSeek researcher Deli Chen open-sourced Deli AutoResearch SKILL, a SKILL.md protocol file that defines the operating rules for AI's long-term autonomous research, including state persistence, stagnation detection, heartbeat mechanism, etc., aiming to decompose autonomous scientific research from a vision into a sustainable engineering closed loop.
View Cached Full Text
Cached at: 06/20/26, 06:21 PM
When AI starts researching AI on its own, what might truly be open-sourced is not code, but a set of research protocols.
DeepSeek researcher Chen Deli has open-sourced the Deli AutoResearch SKILL — definitely worth a look. It essentially serves as the operating rules for enabling AI to conduct long-term research autonomously.
It is not a complex codebase but a single SKILL.md protocol file, specifically designed to tackle the most common derailment issues in long-running agents: getting stuck in loops, stopping mid-task to wait for humans, and silently dying after context compaction.
Its approach is highly engineering-driven: write state to files instead of relying on chat context; force logging progress and direction after each round; pivot to a structurally different direction when stuck; and use a heartbeat mechanism to monitor whether the agent is actually alive.
The most compelling part is that this framework breaks down “autonomous research” from a vague vision into concrete operational rules.
An agent is no longer just writing papers, running experiments, tuning GPUs, or summarizing results — it gets placed inside a research loop that can iterate continuously, self-check, and prevent dead cycles.
Of course, this is still far from fully replacing human researchers. The page honestly notes: scores come from simulated peer review, citation hallucinations still exist, and the longest continuous run still retained human directional input.
But the real insight here is: the key to future research agents may not be a more powerful model, but rather a protocol, state system, and error-correction mechanism that allows them to work stably over the long term.
We used to ask: “Can AI do research?”
Now the more interesting question is:
Who can first turn AI doing research into a sustainable production pipeline? https://victorchen96.github.io/auto_research/framework.html…
Deli_AutoResearch — Autonomous Research Framework
Source: https://victorchen96.github.io/auto_research/framework.html The skill itself is one self-contained Markdown document.
It depends on no external resource or private infrastructure — a single SKILL.md fully defines the whole protocol: motivation, behavioral constraints, architecture, state files, stall detection, heartbeat watchdog, scheduling patterns, and engineering constraints. Below is a structured reading guide; the complete SKILL.md is appended at the bottom, copyable in one click.
↓ Jump to full SKILL.md (https://victorchen96.github.io/auto_research/framework.html#fullmd)
01Motivation: Three Failure Modes
Long-running code agents exhibit three recurring failure modes whose common cause is missing engineering scaffolding, not model capability. Every mechanism in the framework targets them.
01
Cognitive Loop
Successive iterations try similar directions with diminishing returns, unable to escape a local optimum on their own.
02
Stalling
The agent finishes a chunk, summarizes, and waits for feedback. Externally the session looks alive and polling runs, but work has stopped — more common than crashes.
03
Runtime Fragility
Context compaction silently breaks the loop; closing a session takes down the timers parasitic on it. Failures go unnoticed by default.
02Behavioral Constraints
Hard rules of the framework, each induced from a real failure.
i
Zero interaction
No prompting the user during a run: no Plan Mode, no question tool, no ending on a question. Continue until stopped. Resolve ambiguity yourself and log the reasoning (level=decision).
ii
Ready means execute
The most common hidden violation: finishing all prep then asking “should I submit?”. Prep exists to be executed; submitting, resubmitting, fixing, and starting monitors are routine — no confirmation needed.
iii
Callback means report-alive
After context compaction the loop silently dies. The first action of every callback updates its own last_seen, then checks liveness; on failure it restarts immediately and logs it.
iv
Persist state to files
All progress is written to state/ files, not conversation memory. Each iteration starts a fresh session, injecting only curated state; never resume.
v
Guardian / worker separation
A heartbeat patrol may only do three things to tasks that aren’t its own: liveness-check, restart, nudge. It does not read their data, modify their state, or report for them. Basis: a patrol once overstepped into another task’s business, causing context pollution, reporting drift, and concurrent-write risk.
03Architecture
The orchestrator monitors state, detects stalls, and injects new directions; each task runs in its own fresh session. Three core decisions: separate execution from evaluation, prefer fresh sessions over resume, and enforce direction diversity.
┌── Orchestrator (current session / durable cron) ─────────┐│monitor state files → detect stalls → inject new direction│└──────┬──────────────────┬──────────────────┬────────────┘▼ ▼ ▼[Task A][Task B][Task C]← each its own fresh session
04State File System
Each task keeps its own state and log directories. Three process types write separate log streams, so debugging never needs cross-file correlation.
{task}/state/ ├── task_spec.md# goal / milestones / success criteria├── progress.json# {iteration, status, stale_count, ...}├── findings.jsonl# accumulated findings (append-only)├── directions_tried.json# directions tried (basis for diversity)└── iteration_log.jsonl# per-iteration summary{task}/logs/ ├── work.jsonl# work agent; decisions tagged level=decision├── orchestrator.jsonl# orchestrator└── heartbeat.jsonl# heartbeat watchdog
05Stall Detection & Pivoting
MechanismRuleStall detectionAn iteration with 0 new findings or a metric drop → stale_count + 1Forced pivotstale_count ≥ 2 → change a structural constraint, not tactical parameters; ≥ 4 → flag for human attentionDirection diversityA new direction must differ from every tried one; after a stall, inject perturbation (start from the opposite hypothesis, find structurally similar cross-domain cases, etc.)Round capA single work session caps at 15 rounds or 30 minutesWhy pivot structure, not tactics
This comes from practice: when a task stalls repeatedly within a frame, the decisive gain usually comes from correcting the environment/structural constraint itself, not from tuning strategy parameters harder inside the existing frame. Two stalls should prompt questioning the environment, not deeper digging in one direction.
06Heartbeat Watchdog (3-Layer)
The business loop is itself unreliable and needs an independent guardian layer. Three mutually-checking layers: any one dying can be detected and recovered by another.
LayerFormRoleL0A resident shell guard depending on no sessionHeartbeat timestamp stale > 2h → spin up an emergency patrol via a headless agentL1A durable scheduled job, hourlyCheck each loop’s last_seen, restart timed-out loops, detect stalling and nudgeL2Business loops, each in its own sessionFirst line of each callback updates its own last_seenStall detection threshold
If progress has no update for over 2h and the last output is a question → judged stalled, launch a nudge subagent (inject the task’s task_spec and progress, instruct it to continue and update state). Three consecutive nudges with no progress → judged structurally stuck; stop nudging and reopen with a new direction. The 2h threshold is deliberately shorter than the 4h stuck-task threshold: stalling is a voluntary stop, cheap to fix, worth catching earlier.
07Subagent Scheduling Patterns
PatternUseKey IdeaA Goal-drivenResearch iterationInject tried directions, require verifiable findings, write back to findings.jsonlB Parallel explorationComplex sub-problemsFire multiple agents in one message: investigation, refutation, cross-domain analogyC Experiment runLong compute jobsStart minute-level polling right after submit: auto-diagnose errors, fix, resubmitD VerificationPost-iteration QAAn independent subagent audits the evidence chain of findingsA subagent prompt should include: background, a verifiable deliverable, working directory, file/line caps, and completion criteria.
08Engineering Constraints
Induced by the meta-learning loop from real failures; violating them empirically caused stalls or regressions.
1
At most 5 large files per iteration; no single file over 300 lines.
2
State is injected via files, not conversation history.
3
Validation (test / compile / check) must run between iterations.
4
Citation-like content is verified every 20 entries, never batched up.
5
With multiple candidate directions, prefer adding diversity over digging one deeper.
6
Unresolvable external-dependency failures escalate: full report + notify the owner + poll for a reply; never abandon silently.
09Validation & Limits
The framework has carried several heterogeneous long-horizon tasks. Output of the paper-writing track (pages / citations / in-framework self-rating):
PaperPagesCitationsSelf-ratedAutonomous Research Agents592288.0Continual Learning653268.0Long-Horizon Decision-Making553848.0Self-Play (285B RL experiment + theory hardening)752178.6Limits (honest disclosure)
- Scores come from in-framework multi-persona simulated review; comparable only longitudinally within the same protocol, not an external quality claim.
- Longest continuous run was 72 hours with 6 directional human inputs — zero operational intervention, directional intervention retained.
- Fabricated citations and data artifacts originate from the LLM itself; the framework makes external checking a mechanical step in the process, it does not remove the error source.
- Separation of duties relies on protocol constraints, not model discipline; removing the constraints brings overstepping back.
10Full SKILL.md
The authoritative source for the guide above — one self-contained Markdown document depending on no external resource. Expand to read; copy in one click.
▶Expand full SKILL.mdCopy``
name: Deli_AutoResearch description: A protocol framework for long-horizon autonomous tasks. Targets three empirically-observed failure modes — cognitive loops, stalling, runtime fragility — by prescribing state management, stall detection, and watchdog mechanisms. Validated on multiple task types including paper writing (4 ICLR-format surveys, in-framework self-rating 8.0-8.6/10). type: Agent Framework tags: autonomous, long-horizon, zero-interaction, anti-loop, heartbeat-watchdog, loop, multi-agent, unattended, orchestration
Deli_AutoResearch
This skill is a protocol framework for long-horizon autonomous tasks (days to weeks). It ships no executable code; instead it prescribes a set of battle-tested conventions: how state is persisted, how stalls are detected, how guardians are layered, and what constraints bind agent behavior. Implementation details are left to the adopter’s environment.
1. Motivation
Long-running code agents exhibit three recurring failure modes:
- Cognitive loop — successive iterations try similar directions with diminishing returns, unable to escape a local optimum on their own.
- Stalling — the agent finishes a chunk of work, outputs a summary, and waits for user feedback. Externally the session looks alive and polling runs, but work has effectively stopped. Run logs show this is more common than crashes.
- Runtime fragility — context compaction silently breaks the loop; closing a session takes down the timers parasitic on it. Failures go unnoticed by default.
The common cause of all three is missing engineering scaffolding, not insufficient model capability. Every mechanism in this framework targets the failure modes above.
2. Behavioral Constraints
- Zero interaction — no prompting the user during a run: no Plan Mode, no question tool, no ending on a question. Continue working until the user stops you. Resolve ambiguity yourself and write the reasoning to the log (level=decision).
- Ready means execute — the most common hidden violation: finishing all preparation and then asking “should I submit?”. The purpose of preparation is execution; submitting, resubmitting, fixing, and starting monitors are all routine operations needing no confirmation.
- Callback means report-alive — after context compaction the loop dies silently. The first action of every callback is to update its own last_seen, then check liveness; on detecting failure it restarts immediately and logs it.
- Persist state to files — all progress is written to state/ files, not conversation memory. Each iteration starts a fresh session, injecting only curated state; never use resume.
- Guardian / worker separation — a heartbeat patrol may take only three actions on tasks that are not its own: liveness-check, restart, nudge. It does not read their data, modify their state files, or report to the user on their behalf.
3. Architecture
┌── Orchestrator (current session / durable cron) ──┐
│ monitor state files → detect stalls → inject direction │
└────┬─────────────┬─────────────┬────────────┘
[Task A] [Task B] [Task C] ← each its own fresh session
Core design decisions:
- Separate execution from evaluation — the agent doing the work does not judge its own progress; stall determination is made by the orchestration layer based on quantitative metrics.
- Fresh session over resume — context accumulation is the primary cause of cognitive loops. Each iteration starts with fresh context; state is injected via files.
- Enforced direction diversity — before each iteration, read the list of tried directions; a new direction must differ from all history.
4. State Files
{task}/state/
├── task\_spec.md # goal / milestones / success criteria
├── progress.json # {iteration, total\_findings, status, stale\_count}
├── findings.jsonl # accumulated findings (append-only)
├── directions\_tried.json # directions already tried
└── iteration\_log.jsonl # per-iteration summary
{task}/logs/
├── work.jsonl # written by work agent; decisions tagged level=decision
├── orchestrator.jsonl # written by orchestrator
└── heartbeat.jsonl # written by heartbeat watchdog
Log line format: {“ts”:“…”, “source”:“…”, “level”:“info|warn|error|decision”, “event”:“…”, “detail”:“…”}
5. Usage
# 1. Initialize the task directory, write state/task\_spec.md and an initial progress.json
# 2. Start the orchestrator loop:
/loop 2h check all tasks under : (1) read progress.json;
(2) if stale\_count>=3 generate a fresh direction; (3) launch a work agent
via the Agent tool (with explicit goal and completion criteria);
(4) write results back to state files. Zero interaction.
# 3. Register a durable heartbeat watchdog (survives across sessions):
hourly patrol: write a timestamp; check each loop's last\_seen against interval×3,
restart if exceeded; check each task's progress for stalls over 2h, nudge if stalled.
Zero interaction.
6. Stall Detection & Pivoting
| Mechanism | Rule |
|---|---|
| Stall detection | an iteration with 0 new findings or a metric drop → stale_count + 1 |
| Forced pivot | stale_count >= 2 → change a structural constraint, not tactical parameters; >= 4 → flag for human attention |
| Direction diversity | a new direction must differ from every tried one; after a stall, inject a perturbation strategy |
| Round cap | a single work session caps at 15 rounds or 30 minutes |
“Pivot structure, not tactics” comes from practice: when a task stalls repeatedly within a frame, the decisive gain usually comes from correcting the environment/structural constraint itself, not from tuning strategy parameters harder inside the existing frame.
7. Heartbeat Watchdog
The business loop is itself unreliable and needs an independent guardian layer. Three mutually-checking layers (V3):
| Layer | Form | Depends on | Role |
|---|---|---|---|
| L0 | resident shell guard | no session | heartbeat stale > 2h → spin up an emergency patrol via a headless agent |
| L1 | durable cron, hourly | a living interactive session | check each loop’s last_seen, restart timed-out loops, detect stalling and nudge |
| L2 | business loop | each its own session | first line of each callback updates its own last_seen |
Any one layer dying can be detected and recovered by another.
Stall detection: if progress has no update for over 2 hours and the last output is a question → judged stalled, launch a nudge subagent. Three consecutive nudges with no progress → judged structurally stuck; stop nudging and reopen with a new direction. The 2h threshold is deliberately shorter than the 4h stuck-task threshold.
8. Subagent Scheduling Patterns
| Pattern | Use | Key idea |
|---|---|---|
| A Goal-driven | research iteration | inject tried directions, require verifiable findings, write back to findings.jsonl |
| B Parallel exploration | complex sub-problems | fire |
Similar Articles
@PierceZhang34: Sharing an open collaborative repository focused on AI-assisted research: Awesome Vibe Research. The core goal is to collect and curate reusable, verifiable, and evolvable AI-assisted components across the full research workflow (from idea generation to paper publication and dissemination), including: Agents, Skills...
Shared an open collaborative repository Awesome Vibe Research maintained by ModelScope. This repository collects and curates reusable, verifiable, and evolvable AI-assisted components across the full research workflow, including agents, skills, workflows, tools, and best practices. It aims to help researchers and developers leverage AI to improve research efficiency.
@wsl8297: For those who usually use DeepSeek for coding, check out DeepSeek-Code-Whale. GitHub: https://github.com/usewhale/DeepSeek-Code-Whale... Open-source terminal AI coding agent, specialized...
DeepSeek-Code-Whale is an open-source terminal AI coding agent, specifically optimized for DeepSeek models, supporting MCP tools, Skills extensions, prefix caching optimization (90% cache hit rate) and 1M context window, aimed at reducing AI coding costs and providing efficient command-line workflows.
@Xudong07452910: Open source project recommendation: loop-engineering — a practical framework that gives your AI coding agent self-looping and intelligent orchestration capabilities. loop-engineering is a very popular concept right now, offering practical patterns, starters, and CLI tools to help developers design systems…
loop-engineering is an open-source framework that provides self-looping and intelligent orchestration capabilities for AI coding agents (such as Claude Code, Codex, Cursor). It includes 7 production-grade loop patterns, practical CLI tools, and a five-data-block design, helping developers transition from manual prompting to systematic automation.
@WWTLitee: Is there a way for AI to autonomously iterate and optimize? Yes, check out autoresearch. Its core isn't to have AI directly 'invent papers,' but to break the research process into a verifiable loop: humans write program.md to give research direction, AI agent modifies http://tra…
Introduces the autoresearch project, which breaks down the AI research process into a verifiable loop (fixed environment, single editable file, fixed metric, Git rollback), enabling AI agents to perform controllable and reproducible experiment iterations; also mentions the 12-factor-agents checklist.
@gyro_ai: https://x.com/gyro_ai/status/2055198700016660826
Matt Pocock open-sourced Skills for Real Engineers, a set of small, composable, and hackable AI coding skills designed to address issues in AI programming such as understanding bias, lack of shared language, missing feedback loops, and software entropy. The tool enhances AI programming efficiency through skills like grill-with-docs, tdd, and diagnose, and provides a complete workflow.