FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

arXiv cs.CL Papers

Summary

FS-Researcher introduces a file-system-based dual-agent framework that enables LLM agents to conduct deep research beyond context window limits by using persistent external memory as a shared workspace. The framework achieves state-of-the-art results on research benchmarks and demonstrates effective test-time scaling through computation allocation to evidence collection.

arXiv:2602.01566v2 Announce Type: replace Abstract: Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are open-sourced at https://github.com/Ignoramus0817/FS-Researcher.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Source: https://arxiv.org/html/2602.01566

Chiwei Zhu¹'², Benfeng Xu¹'²†, Mingxuan Du¹, Shaohan Wang¹
Xiaorui Wang², Zhendong Mao¹, Yongdong Zhang¹

¹University of Science and Technology of China
²Metastone Technology

{tanz, benfeng}@mail.ustc.edu.cn

###### Abstract

Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at https://github.com/Ignoramus0817/FS-Researcher.

⁴Work done during the internship in Metastone Technology.
²Corresponding author. Project Lead.

## 1 Introduction

Deep Research has recently emerged as a frontier and representative task for autonomous large language model (LLM) agents, demanding PhD-level expertise (OpenAI, 2025; Google, 2025). Given an open-ended research query, deep research requires an agent to systematically collect evidence from the internet and synthesize it into a comprehensive report, which often involves navigating through hundreds of webpages and producing long reports containing more than 10K tokens. The complexity of deep research poses a major challenge to agent design: models' context lengths are inherently limited, while long-horizon research tasks can easily exceed these capacities, halting further agent execution. This limitation prevents agents from allocating sufficient computation to tasks—the token budgets available for both information gathering and report writing are severely compressed, often falling short of what the task actually demands. As a result, static pipelines or single-agent workflows often suffer from incomplete coverage of relevant sources and produce lower-quality reports (Shao et al., 2024; Roucher et al., 2025; Tao et al., 2025; Li et al., 2025a; Zheng et al., 2025).

To address this challenge, recent works typically reduce token consumption by offloading web browsing to sub-agents or summarizing tool observations, retaining only distilled key facts in the main agent context (Tavily, 2025; LangChain, 2025; Li et al., 2025b; Lei et al., 2025; Prabhakar et al., 2025). While these methods extend the working trajectories of agents, they are temporary fixes that remain constrained by the hard limit of model context length. Moreover, in these approaches, internal states such as thoughts and tool observations are ephemeral consumables that are discarded once the agent loop terminates, hindering further scaling through iterative refinement.

![Different deep research paradigms: (1) Top: Static pipelines and naive single agents that put raw observations in the context; (2) Middle: Agents whose trajectories are extended by compressing the observations, while still bounded by the hard context limit; (3) Bottom: FS-Researcher, an agent framework built on top of an external file system workspace with unlimited context size.](Figure 1)

Recent progress in coding agents and AI-powered IDEs suggests that a file-system workspace is an effective substrate for long-horizon tool use and iterative development (Yang et al., 2024; Cursor, 2025; Anthropic, 2025b). Inspired by this paradigm yet addressing the unique nature of deep research—where agents must navigate noisy web content, extract and organize factual evidence, and synthesize coherent narratives—we propose FS-Researcher, a dual-agent framework that separates evidence accumulation from report composition. The first agent, Context Builder, functions as a librarian that browses the internet, reads potentially relevant documents, takes notes, and archives them into a hierarchically organized knowledge base whose size can far exceed context limits. The second agent, Report Writer, composes the report section by section, treating the knowledge base as its sole source of facts and loading relevant information on demand.

Figure 2 shows the overview of FS-Researcher. Our file-system-based workspace offers three key advantages: (1) it mirrors the native environment humans use for complex, long-horizon tasks, providing well-established interfaces for deep research; (2) it can store information far exceeding the model's context window, allowing on-demand access without context overflow; and (3) it makes intermediate artifacts (e.g., plans, error logs) persistent and revisitable, enabling iterative refinement across multiple agent sessions. Notably, file I/O introduces negligible latency (<<0.03% of total wall-clock time; see Appendix M). A comparison of file-system-based agents and existing paradigms is shown in Figure 1.

As a result, our framework natively enables iterative refinement through the persistent workspace, thus allowing better token utilization for both information gathering and report writing. Extensive experiments demonstrate that FS-Researcher achieves state-of-the-art performance on open-ended deep research benchmarks across various backbone models. Further ablation studies reveal a positive correlation between the quality of the final report and the computation allocated to the Context Builder agent, indicating effective test-time scaling with the file-system paradigm.

Collectively, the contributions of this work are as follows:

- We propose FS-Researcher, a dual-agent, file-system-based framework for solving long-horizon research tasks.
- We validate the effectiveness of our framework through extensive experiments.
- We demonstrate a positive correlation between deep research performance and the computation invested in context building.

![The framework of FS-Researcher.](Figure 2)

## 2 FS-Researcher

FS-Researcher is a dual-agent framework that solves research tasks using a file-system-based workspace. The two agents also represent two stages: given a research topic, the Context Builder agent builds a comprehensive knowledge base, and then the Report Writer agent composes the report section by section. The agents share the same workspace and can refine the deliverables independently and iteratively.

### 2.1 Architecture

Before diving into the two agents, we introduce the common architecture that drives the whole framework, including three parts: tools, workflow, and workspace.

#### Tools

FS-Researcher uses two types of tools: file system tools and web browsing tools, listed in Table 1. We use Google SERP API and Jina AI API for search_web and read_webpage respectively.

#### Workflow

FS-Researcher adopts a standard ReAct architecture for each agent, which can be formulated as follows:

T_i, A_i = M_θ(T_{j<i}, A_{j<i}, O_{j<i}, P)  (1)

O_i = Execute(A_i)  (2)

T_i, A_i, O_i are the thought, action, and observation at the i-th step, respectively. M_θ is the model with parameters θ. P is the prompt (system prompt and user query). Execute(A_i) is the tool implementation that executes the action A_i and returns the observation O_i.

#### Workspace

The workspace of FS-Researcher contains two types of files: deliverables and control files. All the files in the workspace are stored in Markdown format. Deliverables are the final output files, which vary in form and convention depending on the task type. Detailed deliverables of each agent will be introduced in Section 2.2 and Section 2.3.

Control files help the agents track the progress, and contain:

- **Todos**: A list of tasks to be completed, each with a status of [PENDING], [IN-PROGRESS], or [COMPLETE].
- **Checklist**: The acceptance criteria for a task, including file format rules, quality checks, etc.
- **Logs**: A log of the execution trajectory.

The framework natively supports multi-session workflow. At the beginning of each session, the agent inspects the current workspace, formulates a plan, and commences execution. During execution, the agent dynamically updates the todo file by modifying item statuses and adding, removing, or reordering tasks as needed. Upon session completion, the agent evaluates the workspace against the checklist, re-marking any non-compliant items as [IN-PROGRESS], and determines whether the overall task is complete. All inspection results, review findings, and session plans are recorded in the log file, which remains accessible to subsequent sessions and human collaborators, thereby facilitating iterative refinement.

In our design, we let the agent generate todos autonomously and manually curate a static checklist. The checklists and a log example are shown in Appendix B, demonstrating how control files help recording the status and identifying issues.

| Type | Tool Name | Description |
|------|-----------|-------------|
| File System | ls | List the files and sub-directories in target directory. |
| | grep | A simplified version of UNIX grep command, search with a regular expression. |
| | read_file | Read a file. Pagination is supported with page size and page index as arguments. |
| | insert/delete/replace | Modify certain lines in a file. Insert will write after the specified line by default. |
| Web Browsing | search_web | Search a query and return relevant URLs and summaries. |
| | read_webpage | Read a URL. Pagination is supported with page size and page index as arguments. |

**Table 1: Tools used in FS-Researcher.**

### 2.2 Context Builder

Given a research topic, the Context Builder works as a digital librarian that meticulously collects, distills, and archives information into a knowledge base (KB).

![Knowledge base example.](Figure 3)

The deliverables of this agent include one file (index.md) and two directories (knowledge_base/ and sources/). The index.md is like the "Table of Content" of the KB, which contains two parts: (1) the deconstruction of the research topic, and (2) the hierarchical structure of the KB. From the index.md, the agent or human collaborators can get an overview of what the KB is built for and how it is organized, and navigate to specific information sources more efficiently.

Todos are created and updated along with the modification of index.md, which guides the browsing process. The knowledge_base/ directory contains the notes written by the Context Builder when browsing the internet, organized in a tree-like structure. The names of folders and files are descriptive, reflecting the semantic relationships between the deconstructed topics. The sources/ contains the raw webpages archived from the internet. For trackability, each statement in the notes of knowledge_base/ comes with a citation that points to a file in the sources/ directory.

During the context building stage, the Context Builder browses the internet with search_web and read_webpage tools, updates the index.md, distills key information into notes in the knowledge_base/ directory, and archives raw webpages in the sources/ directory. Note that this workflow is not linear, i.e. first deconstruct the topic, design a target structure of KB and fill the folders with files. Instead, the index.md and knowledge_base/ directory are dynamically updated as the agent browses the internet and gradually forms its understanding of the topic.

At the end of each session, the Context Builder conducts a review against the checklist, identifying any potential errors, gaps, or conflicts in the knowledge base. If any issues are found, corresponding items are marked as [IN-PROGRESS] and recorded in the log file. The Context Builder can iteratively refine the knowledge base until it reaches the session budget limit (i.e., the maximum number of sessions that the Context Builder is allowed to run) or does not identify any issue in the review.

Figure 3 shows an example of the knowledge base, full structure in Appendix A.

### 2.3 Report Writer

Once the Context Builder marks the knowledge base as complete, the Report Writer takes over the workspace and starts to compose the report. In this stage, we remove the web browsing tools and let Report Writer treat the knowledge base built by Context Builder as the only source of facts. The deliverable of this stage is report.md.

A critical observation is that if the whole report is written in one-shot generation, it tends to read like a mere list of facts, lacking explanation and in-depth analysis. Therefore, we adopt a multi-session writing process, where the Report Writer creates an outline file in the first writing session, and chooses exactly one section to compose in subsequent sessions. The outline also serves as the todo file for the Report Writer, where each section carries a status of [PENDING], [IN-PROGRESS], or [COMPLETE].

Upon the completion of a section, the Report Writer performs a section-level review against the section-level checklist. The status of the current section is changed to [COMPLETE] only when the self-check passes. After all sections are completed, an overall review is conducted using the report-level checklist. If flaws are identified, the corresponding sections are marked as [IN-PROGRESS] again. The Report Writer continuously executes until the entire report is finished and passes all reviews. There is no budget limit in this stage.

| Category | Method | RACE | FACT |
|----------|--------|------|------|
| | | Comp. | Instr. | Eff. c. | C. Acc. |

**Table 2: Performance on DeepResearch Bench.**

Comp., Instr., Eff. c. and C. Acc. denotes comprehensiveness, instruction following, effective citations and citation accuracy respectively. The best performance is highlighted in bold and the second best is underlined.

Similar Articles

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Hugging Face Daily Papers

TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

arXiv cs.AI

Introduces ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make forward-looking research judgments from historical evidence. It contains 500 tasks across four AI domains and shows that explicit evidence organization improves traceability but reveals a recurring evidence-decision decoupling.