EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

arXiv cs.CL Papers

Summary

EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.

arXiv:2601.05808v2 Announce Type: replace Abstract: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Source: https://arxiv.org/html/2601.05808

Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

Gaoling School of Artificial Intelligence, Renmin University of China. {songxiaoshuai,dou}@ruc.edu.cn

GitHub: https://github.com/RUC-NLPIR/EnvScaler

## Abstract

Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions.

## 1 Introduction

Large language models (LLMs) are increasingly expected to serve as agents in a wide range of real-world applications, such as modifying orders in e-commerce backends, rescheduling flights via ticketing platforms, or managing documents in a file system (Luo et al., 2025; Yao et al., 2025; Qian et al., 2025). In these applications, the agent operates within a specific environment (Env), interacting with the user to gather information and invoking tools to query or update the Env's state. This challenges LLMs to combine dialogue and tool use, adapt actions based on Env feedback, and solve tasks while respecting Env rules over long-horizon trajectories.

To develop such capable LLM agents, scaling up rich and diverse tool-interactive environments is essential. Whether by collecting trajectories followed by imitation learning, or by autonomous exploration and reinforcement learning (RL) within Envs, we hope that exposure to a sufficiently broad range of environments during training will enable LLMs to generalize effectively to unseen environments and scenarios at test time (Huang et al., 2025; Liu et al., 2025a; Froger et al., 2026).

However, as compared in Table 1, real-world environments often have restricted access; LLM-simulated environments also suffer from hallucinations and inconsistencies. Recently, a series of studies (Patil et al., 2025; Yao et al., 2025; Lu et al., 2025) build stateful, tool-interactive sandboxes through executable programs, offering advantages in controllability and stability. Nonetheless, these environments are manually crafted for evaluation purposes, with limited coverage and scalability. Therefore, a key challenge lies in automating the synthesis and scaling of sandbox environments to support training. It requires creating diverse, high-quality environments with states, tools, and interaction logic, and designing tasks that align with each environment.

| Env Type | Scalable | Consistent | Controllable | Stable | Explainable |
|----------|----------|-----------|-------------|--------|-----------|
| Real-World | ✗ | ✓ | ✗ | ✓ | ✓ |
| LLM-Simulated | ✓ | ✗ | ✓ | ✗ | ✗ |
| Programmatic | ✓ | ✓ | ✓ | ✓ | ✓ |

**Table 1:** Key property comparison of three Env types for LLM training. Scalable: ease of large-scale expansion; Consistent: logical coherence between multiple calls; Controllable: flexibility in modifying Env logic; Stable: reproducible over time; Explainable: transparency of Env logic. Symbols denote: ✓ full support, ✗ not supported, ✓ partial or conditional support.

Several studies have made progress in tackling this challenge, with LLMs used as programmers of environment logic rather than direct simulators. One approach (Ye et al., 2025; Sullivan et al., 2025) focuses solely on tool-layer modeling. It does not model the sandbox's state, nor consider the interaction logic between tools and the database. Another approach (Tang et al., 2024; Piriyakullij et al., 2025) seeks to programmatically reconstruct environments from existing observations (e.g., trajectories), but inevitably depends on access to pre-existing environments. Besides, AgentScaler (Fang et al., 2025) and AutoForge (Cai et al., 2025) rely on pre-collected toolsets or tool documentation, and lack an automated mechanism for assessing environment quality. Due to these limitations, a notable gap remains in automatically synthesizing and scaling tool-interactive environments without relying on environmental priors or toolsets.

To bridge this gap, we propose **EnvScaler**, an automated, scalable framework for synthesizing diverse, executable, tool-interactive environments to train LLM agents. We first introduce **SkelBuilder** to automate the construction of environment skeletons, covering topic mining, logic modeling, and assessment. It comprises three modules: (1) **Task-driven environment discovery**: mines diverse environment themes from existing open-source task sets. (2) **Executable environment construction**: starting from an environment description, it plans states and tools, and programmatically implements them into a complete, runnable environment. (3) **Quality inspection**: a testing agent sends tool requests, while a checking agent assesses whether executions meet expectations. This process iterates over multiple rounds, with the pass rate indicating environment quality.

To further synthesize multiple task scenarios for each environment, we propose **ScenGenerator**. To ensure task relevance and solvability within a given environment and scenario, ScenGenerator first synthesizes the environment's initial database/state, and then derives challenging tasks from the current state. To achieve rule-based trajectory verification, ScenGenerator generates a set of terminal-state validation functions for each task. After the trajectory ends, these functions check whether the final environment state meets the expected conditions, using the functions' pass rate as the reward score.

To validate the effectiveness of EnvScaler, we synthesized 191 environments and about 7K scenarios, applying them to SFT and RL for the Qwen3 series models. Evaluation on multiple tool-use benchmarks (Patil et al., 2025; Yao et al., 2025; Chen et al., 2025) shows that EnvScaler significantly enhances LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. Further analysis of environment coverage, scale, and training strategies provides insights into how synthetic environments promote tool learning and generalization for LLM agents.

In summary, we propose EnvScaler for scalable tool-interactive environment synthesis. Our contributions are threefold: (1) We propose SkelBuilder, an automated framework for synthesizing diverse, executable environment skeletons. (2) We propose ScenGenerator, a scenario generation pipeline that produces state data, challenging tasks, and rule-based trajectory verification for each environment. (3) Experiments on three benchmarks verify the effectiveness of EnvScaler in improving LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions.

## 2 Related Work

### 2.1 Tool Use of LLMs

Many studies aim to improve LLMs' ability to solve tasks with tools (Qu et al., 2025; Luo et al., 2025). In this paper, we focus on general tool use across various domain-specific environments (Patil et al., 2025; Yao et al., 2025; Chen et al., 2025), rather than tool-integrated reasoning and web information access centered on Python or search tools (Dong et al., 2025; Li et al., 2025a). Some work have explored the training data and RL strategies from different perspectives (Prabhakar et al., 2025; Liu et al., 2025b; Xu et al., 2025; Zhang et al., 2026; Zhao et al., 2025). However, they mainly focus on synthetic static trajectories and cannot support LLMs' self-exploration. For trajectory evaluation, they primarily rely on surface matching, checking whether generated tool names and parameters match references, which is neither sufficient to determine whether the task is truly completed nor able to accommodate multiple equivalent solution paths. In contrast, we synthesize executable environments and tasks, along with rule-based evaluation grounded in environments' state, thereby supporting LLMs' training across varied scenarios.

### 2.2 Scaling Environments for LLM Agent

Environments provide agents with action feedback and rewards for interaction and policy optimization. We focus on tool-interactive environments, where LLM agents can use tools to query environmental information or change the state of the environment. One line of work (Guo et al., 2024, 2025; Castellani et al., 2025; Li et al., 2025b) leverages LLMs' reasoning and world knowledge to simulate environments. Although there is no need to build real environments, it is prone to hallucinations and inconsistencies, and lacks transparency and persistent state management. Another line of work (Tang et al., 2024; Ye et al., 2025; Fang et al., 2025; Cai et al., 2025) builds sandbox environments through programming. However, they either only model isolated, stateless functions, or rely on environmental priors (e.g., trajectories, toolsets) and lack automatic assessment, which limits scalability and coverage. Therefore, we propose EnvScaler to enable automatic, scalable environment and scenario synthesis for agent training.

## 3 Automated Env Skeleton Synthesis Overview

The goal of SkelBuilder is to construct environments {E}, where each can be abstracted as a set of three elements:

E = {F_exec, E_doc, Σ_tool}

- **Executable program files F_exec**: Complete logic implementation of E's states, tools, and rules.
- **Documentation E_doc**: Provides the agent with introductions or rules about E.
- **Tool interface set Σ_tool**: Names, parameters, and descriptions of all tools exposed to the agent, serving as the entry for agent–Env interaction.

As shown in Figure 3, SkelBuilder enables an automated workflow from text resource mining to environment modeling and evaluation.

### 3.1 Task-Guided Env Discovery

The first step in scaling environments is to collect diverse environment themes. Unlike manual presetting or derivation from API collections (Fang et al., 2025), SkelBuilder mines them from existing text resources. Considering that studies around SFT have gathered a large and diverse set of tasks that may implicitly contain latent environmental contexts, this inspired us to derive themes through reverse inference from the existing tasks.

Given a task set T_exist = {t_1, ..., t_n}, an LLM M first performs binary filtering to retain tasks situated within a domain-specific, stateful environment. For each retained task, M infers the corresponding environment description:

{E'_des} = {M(P^env_infer || t) | t ∈ T_exist, M(P^task_filter || t)},   (1)

where P^task_filter and P^env_infer denote prompts for task filtering and environment inference. The inferred environments are then aggregated and deduplicated by embedding each description and retaining one record from groups of highly similar descriptions, yielding the final diverse, non-redundant set {E_des} = Dedup({E'_des}, sim).

### 3.2 Automated Executable Env Construction

To transform the environment description into a programmatically modeled environment, we design a three-stage pipeline.

**Logic Planning.** An LLM enriches the environment description E_des, inferring the Env state definition E_state, domain rules E_rule, and the list of tool operations {E_tool_i}. These elements serve as a structured blueprint, with E_rule concatenated with E_des to form the environment documentation E_doc:

E_state, E_rule = M(P^state_plan || E_des),   (2)

{E_tool_i} = M(P^tool_plan || E_des || E_state || E_rule).

Similar Articles

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Hugging Face Daily Papers

ClawEnvKit is an automated pipeline that generates diverse, verified environments for claw-like agents from natural language descriptions, enabling the construction of Auto-ClawEval, a large-scale benchmark with 1,040 environments at 13,800x lower cost than human curation. The system supports continuous, on-demand evaluation and adaptive training environment generation across multiple model families and agent frameworks.