GenClaw:代碼驅動的智能體圖像生成
GenClaw: Code-Driven Agentic Image Generation
May 28, 2026
作者: Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang, Rui Chen, Weijia Li
cs.AI
摘要
圖像生成模型已從依賴文字條件的像素合成,進化為具備視覺理解與工具調用能力的多模態代理。然而,現有代理仍受制於底層的黑箱圖像模型,其工作流程陷入重複性的提示改寫循環以優化生成結果,缺乏直接操控畫布(canvas)的機制。本質上,大型語言模型(LLM)作為真正「畫筆」以實現精確視覺建構的潛力仍未充分開發。本文提出 GenClaw,一種程式碼驅動的代理式圖像生成範式,賦予代理如人類藝術家般的創作能力:先構思概念,再繪製草圖,最後進行上色。具體而言,代理首先透過搜尋與推理構建概念知識與上下文,接著利用程式碼(如 SVG、HTML、Three.js)呈現可執行的視覺草圖,最後藉由圖像生成模型補充紋理、材質與逼真度。在此工作流程中,程式碼作為可控的中間畫布,橋接了語言推理與像素合成,將程式邏輯與生成模型的視覺表現力無縫整合。透過將圖像生成從黑箱範式轉變為類似人類真實創作的分階段過程,GenClaw 為高度可控且可解釋的視覺生成系統邁出重要一步。
English
Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.