GenClaw：代码驱动的代理式图像生成

摘要

图像生成模型已从基于文本条件的像素合成，演进为具备视觉理解与工具调用能力的多模态代理。然而，现有代理仍受制于底层黑箱式图像模型，其工作流程陷入反复改写提示词以优化生成的循环，缺乏直接操控画布的机制。本质上，大语言模型作为真正"画笔"实现精准视觉构建的潜力尚未充分发挥。本文提出GenClaw，一种代码驱动的代理式图像生成范式，使代理能像人类艺术家般创作：先构思，再勾勒，最后着色。具体而言，代理首先通过搜索与推理构建概念知识与上下文语境；随后利用代码（如SVG、HTML、Three.js）生成可执行的视觉草图；最后借助图像生成模型补充纹理、材质与逼真度。在此流程中，代码作为连接语言推理与像素合成的可控中间画布，将编程逻辑与生成模型的视觉表现力无缝融合。通过将图像生成从黑箱范式转变为类似人类创作的分阶段过程，GenClaw为构建高可控性与可解释性的视觉生成系统提供了可行方向。

English

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.