GenClaw: Code-Driven Agentic Image Generation
Summary
GenClaw introduces a code-driven agentic image generation framework that breaks the black-box paradigm by mimicking the human creative process: conceptualizing, sketching with code (SVG/HTML/Three.js), and then using generative models for texture and photorealism.
View Cached Full Text
Cached at: 05/29/26, 07:01 AM
Paper page - GenClaw: Code-Driven Agentic Image Generation
Source: https://huggingface.co/papers/2605.30248
Abstract
GenClaw presents a code-driven agentic image generation framework that enables precise visual construction through conceptualization, sketching, and coloring stages, integrating programmatic logic with generative models.
Image generation modelshave evolved from text-conditioned pixel synthesis towardmultimodal agentsendowed withvisual comprehensionandtool invocationcapabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential ofLLMsto serve as a genuine “brush” for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs theconceptual knowledgeand context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executablevisual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness ofgenerative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretablevisual generation systems.
View arXiv pageView PDFGitHub8Add to collection
Get this paper in your agent:
hf papers read 2605\.30248
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30248 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30248 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30248 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
PixelClaw: an LLM agent for image manipulation
PixelClaw is a free, open-source LLM agent that combines conversational AI with image generation, editing, and audio tools in a Raylib-based drag-and-drop UI.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve is a self-evolving image generation framework that uses tool-orchestrated trajectories and visual experience distillation to iteratively improve generative capabilities, achieving state-of-the-art performance.
Alien Dreams: An Emerging Art Scene
The article highlights the emerging scene of AI-generated art using OpenAI's CLIP model as a steering mechanism for generative models, showcasing various examples of text-to-image outputs.
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.
VisualClaw: A Real-Time, Personalized Agent for the Physical World
VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution, while improving video-QA accuracy across multiple benchmarks.