EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Summary
EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.
View Cached Full Text
Cached at: 04/20/26, 08:31 AM
# EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Source: https://arxiv.org/html/2601.05808
Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
Gaoling School of Artificial Intelligence, Renmin University of China. {songxiaoshuai,dou}@ruc.edu.cn
GitHub: https://github.com/RUC-NLPIR/EnvScaler
## Abstract
Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions.
## 1 Introduction
Large language models (LLMs) are increasingly expected to serve as agents in a wide range of real-world applications, such as modifying orders in e-commerce backends, rescheduling flights via ticketing platforms, or managing documents in a file system (Luo et al., 2025; Yao et al., 2025; Qian et al., 2025). In these applications, the agent operates within a specific environment (Env), interacting with the user to gather information and invoking tools to query or update the Env's state. This challenges LLMs to combine dialogue and tool use, adapt actions based on Env feedback, and solve tasks while respecting Env rules over long-horizon trajectories.
To develop such capable LLM agents, scaling up rich and diverse tool-interactive environments is essential. Whether by collecting trajectories followed by imitation learning, or by autonomous exploration and reinforcement learning (RL) within Envs, we hope that exposure to a sufficiently broad range of environments during training will enable LLMs to generalize effectively to unseen environments and scenarios at test time (Huang et al., 2025; Liu et al., 2025a; Froger et al., 2026).
However, as compared in Table 1, real-world environments often have restricted access; LLM-simulated environments also suffer from hallucinations and inconsistencies. Recently, a series of studies (Patil et al., 2025; Yao et al., 2025; Lu et al., 2025) build stateful, tool-interactive sandboxes through executable programs, offering advantages in controllability and stability. Nonetheless, these environments are manually crafted for evaluation purposes, with limited coverage and scalability. Therefore, a key challenge lies in automating the synthesis and scaling of sandbox environments to support training. It requires creating diverse, high-quality environments with states, tools, and interaction logic, and designing tasks that align with each environment.
| Env Type | Scalable | Consistent | Controllable | Stable | Explainable |
|----------|----------|-----------|-------------|--------|-----------|
| Real-World | ✗ | ✓ | ✗ | ✓ | ✓ |
| LLM-Simulated | ✓ | ✗ | ✓ | ✗ | ✗ |
| Programmatic | ✓ | ✓ | ✓ | ✓ | ✓ |
**Table 1:** Key property comparison of three Env types for LLM training. Scalable: ease of large-scale expansion; Consistent: logical coherence between multiple calls; Controllable: flexibility in modifying Env logic; Stable: reproducible over time; Explainable: transparency of Env logic. Symbols denote: ✓ full support, ✗ not supported, ✓ partial or conditional support.
Several studies have made progress in tackling this challenge, with LLMs used as programmers of environment logic rather than direct simulators. One approach (Ye et al., 2025; Sullivan et al., 2025) focuses solely on tool-layer modeling. It does not model the sandbox's state, nor consider the interaction logic between tools and the database. Another approach (Tang et al., 2024; Piriyakullij et al., 2025) seeks to programmatically reconstruct environments from existing observations (e.g., trajectories), but inevitably depends on access to pre-existing environments. Besides, AgentScaler (Fang et al., 2025) and AutoForge (Cai et al., 2025) rely on pre-collected toolsets or tool documentation, and lack an automated mechanism for assessing environment quality. Due to these limitations, a notable gap remains in automatically synthesizing and scaling tool-interactive environments without relying on environmental priors or toolsets.
To bridge this gap, we propose **EnvScaler**, an automated, scalable framework for synthesizing diverse, executable, tool-interactive environments to train LLM agents. We first introduce **SkelBuilder** to automate the construction of environment skeletons, covering topic mining, logic modeling, and assessment. It comprises three modules: (1) **Task-driven environment discovery**: mines diverse environment themes from existing open-source task sets. (2) **Executable environment construction**: starting from an environment description, it plans states and tools, and programmatically implements them into a complete, runnable environment. (3) **Quality inspection**: a testing agent sends tool requests, while a checking agent assesses whether executions meet expectations. This process iterates over multiple rounds, with the pass rate indicating environment quality.
To further synthesize multiple task scenarios for each environment, we propose **ScenGenerator**. To ensure task relevance and solvability within a given environment and scenario, ScenGenerator first synthesizes the environment's initial database/state, and then derives challenging tasks from the current state. To achieve rule-based trajectory verification, ScenGenerator generates a set of terminal-state validation functions for each task. After the trajectory ends, these functions check whether the final environment state meets the expected conditions, using the functions' pass rate as the reward score.
To validate the effectiveness of EnvScaler, we synthesized 191 environments and about 7K scenarios, applying them to SFT and RL for the Qwen3 series models. Evaluation on multiple tool-use benchmarks (Patil et al., 2025; Yao et al., 2025; Chen et al., 2025) shows that EnvScaler significantly enhances LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. Further analysis of environment coverage, scale, and training strategies provides insights into how synthetic environments promote tool learning and generalization for LLM agents.
In summary, we propose EnvScaler for scalable tool-interactive environment synthesis. Our contributions are threefold: (1) We propose SkelBuilder, an automated framework for synthesizing diverse, executable environment skeletons. (2) We propose ScenGenerator, a scenario generation pipeline that produces state data, challenging tasks, and rule-based trajectory verification for each environment. (3) Experiments on three benchmarks verify the effectiveness of EnvScaler in improving LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions.
## 2 Related Work
### 2.1 Tool Use of LLMs
Many studies aim to improve LLMs' ability to solve tasks with tools (Qu et al., 2025; Luo et al., 2025). In this paper, we focus on general tool use across various domain-specific environments (Patil et al., 2025; Yao et al., 2025; Chen et al., 2025), rather than tool-integrated reasoning and web information access centered on Python or search tools (Dong et al., 2025; Li et al., 2025a). Some work have explored the training data and RL strategies from different perspectives (Prabhakar et al., 2025; Liu et al., 2025b; Xu et al., 2025; Zhang et al., 2026; Zhao et al., 2025). However, they mainly focus on synthetic static trajectories and cannot support LLMs' self-exploration. For trajectory evaluation, they primarily rely on surface matching, checking whether generated tool names and parameters match references, which is neither sufficient to determine whether the task is truly completed nor able to accommodate multiple equivalent solution paths. In contrast, we synthesize executable environments and tasks, along with rule-based evaluation grounded in environments' state, thereby supporting LLMs' training across varied scenarios.
### 2.2 Scaling Environments for LLM Agent
Environments provide agents with action feedback and rewards for interaction and policy optimization. We focus on tool-interactive environments, where LLM agents can use tools to query environmental information or change the state of the environment. One line of work (Guo et al., 2024, 2025; Castellani et al., 2025; Li et al., 2025b) leverages LLMs' reasoning and world knowledge to simulate environments. Although there is no need to build real environments, it is prone to hallucinations and inconsistencies, and lacks transparency and persistent state management. Another line of work (Tang et al., 2024; Ye et al., 2025; Fang et al., 2025; Cai et al., 2025) builds sandbox environments through programming. However, they either only model isolated, stateless functions, or rely on environmental priors (e.g., trajectories, toolsets) and lack automatic assessment, which limits scalability and coverage. Therefore, we propose EnvScaler to enable automatic, scalable environment and scenario synthesis for agent training.
## 3 Automated Env Skeleton Synthesis Overview
The goal of SkelBuilder is to construct environments {E}, where each can be abstracted as a set of three elements:
E = {F_exec, E_doc, Σ_tool}
- **Executable program files F_exec**: Complete logic implementation of E's states, tools, and rules.
- **Documentation E_doc**: Provides the agent with introductions or rules about E.
- **Tool interface set Σ_tool**: Names, parameters, and descriptions of all tools exposed to the agent, serving as the entry for agent–Env interaction.
As shown in Figure 3, SkelBuilder enables an automated workflow from text resource mining to environment modeling and evaluation.
### 3.1 Task-Guided Env Discovery
The first step in scaling environments is to collect diverse environment themes. Unlike manual presetting or derivation from API collections (Fang et al., 2025), SkelBuilder mines them from existing text resources. Considering that studies around SFT have gathered a large and diverse set of tasks that may implicitly contain latent environmental contexts, this inspired us to derive themes through reverse inference from the existing tasks.
Given a task set T_exist = {t_1, ..., t_n}, an LLM M first performs binary filtering to retain tasks situated within a domain-specific, stateful environment. For each retained task, M infers the corresponding environment description:
{E'_des} = {M(P^env_infer || t) | t ∈ T_exist, M(P^task_filter || t)}, (1)
where P^task_filter and P^env_infer denote prompts for task filtering and environment inference. The inferred environments are then aggregated and deduplicated by embedding each description and retaining one record from groups of highly similar descriptions, yielding the final diverse, non-redundant set {E_des} = Dedup({E'_des}, sim).
### 3.2 Automated Executable Env Construction
To transform the environment description into a programmatically modeled environment, we design a three-stage pipeline.
**Logic Planning.** An LLM enriches the environment description E_des, inferring the Env state definition E_state, domain rules E_rule, and the list of tool operations {E_tool_i}. These elements serve as a structured blueprint, with E_rule concatenated with E_des to form the environment documentation E_doc:
E_state, E_rule = M(P^state_plan || E_des), (2)
{E_tool_i} = M(P^tool_plan || E_des || E_state || E_rule).Similar Articles
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
EnvFactory automates the creation of executable tool environments and natural multi-turn trajectories for training LLMs with agentic reinforcement learning, achieving superior performance on benchmarks like BFCLv3 and MCP-Atlas with fewer environments than prior work.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
A comprehensive survey on agentic environment engineering for LLMs, covering environment modeling, synthesis, evaluation, and application, with a focus on agent-environment co-evolution.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit is an automated pipeline that generates diverse, verified environments for claw-like agents from natural language descriptions, enabling the construction of Auto-ClawEval, a large-scale benchmark with 1,040 environments at 13,800x lower cost than human curation. The system supports continuous, on-demand evaluation and adaptive training environment generation across multiple model families and agent frameworks.
Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call
Edu-Theater is a data-efficient agent framework that uses LLM-powered generative agents to simulate learner behavior in educational settings. It employs a cohort-aware roll-call paradigm to infer learner states with fewer data and computational resources, achieving higher simulation accuracy.