Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
Summary
Chat2Workflow introduces a benchmark and agentic framework for generating executable visual workflows from natural language, showing current LLMs struggle with industrial-grade automation despite intent capture.
View Cached Full Text
Cached at: 04/22/26, 06:17 AM
Paper page - Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
Source: https://huggingface.co/papers/2604.19667
Abstract
Chat2Workflow presents a benchmark and agentic framework for automating executable visual workflow generation from natural language, revealing significant challenges in achieving industrial-grade automation despite advances in language models.
At present,executable visual workflowshave emerged as a mainstream paradigm in real-worldindustrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduceChat2Workflow, a benchmark for generatingexecutable visual workflowsdirectly from natural language, and propose a robustagentic frameworkto mitigate recurrent execution errors.Chat2Workflowis built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although ouragentic frameworkyields up to 5.34% resolve rate gains, the remaining real-world gap positionsChat2Workflowas a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
View arXiv pageView PDFGitHub7Add to collection
Get this paper in your agent:
hf papers read 2604\.19667
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.19667 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.19667 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.19667 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Abstracting Cross-Domain Action Sequences into Interpretable Workflows
This paper introduces WorkflowView, a framework that uses large language models to abstract low-level, noisy user action sequences into interpretable high-level activities, demonstrating effectiveness across browser logs, MOOC dropout prediction, and privacy-preserving document workflow analysis.
Workspace agents
This article introduces OpenAI's 'Workspace Agents' in ChatGPT, designed to handle repeatable, structured workflows rather than one-off tasks. It explains the core concepts, anatomy, and best practices for using and building these agents for consistent business processes.
Agent workflow visualizer: feedback and corrections
A tool for visualizing AI agent workflows is introduced, supporting multiple agent frameworks including Langgraph, CrewAI, AutoGen, Google ADK, and OpenAI Agents SDK. The creator seeks community feedback and corrections.
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Workflow-GYM is a benchmark for evaluating AI agents on long-horizon GUI tasks in professional domains. Experiments show that even top models achieve only ~30% success, revealing significant challenges.
Tried 12+ agentic AI workflow builders this year — these 5 actually work in production
A review of five agentic AI workflow builders that actually work in production, highlighting SimplAI as a standout enterprise agent operating system and discussing the importance of workflow layer over model quality.