Tag
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.
CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.
MindDR is a multi-agent deep research framework using a three-agent architecture (Planning, DeepSearch, Report) and a four-stage training pipeline, achieving competitive performance with ~30B-parameter models on multiple benchmarks. Developed by Li Auto and deployed as an online product, it also introduces MindDR Bench, a 500-query Chinese benchmark for evaluating deep research capabilities.
OpenAI Universe is a software platform that enables RL agents to interact with any existing computer program through a VNC-based interface, supporting parallel environment execution at 60 FPS with human demonstrations and automated reward extraction.