GenClaw: コード駆動型エージェント画像生成

要旨

画像生成モデルは、テキスト条件付きピクセル合成から、視覚的理解やツール呼び出し機能を備えたマルチモーダルエージェントへと進化してきた。しかし、既存のエージェントは依然として内部のブラックボックス的な画像モデルに依存しており、そのワークフローは生成結果を改善するためのプロンプト書き換えを反復するサイクルに閉じ込められ、キャンバスを直接操作する仕組みを持たない。本質的に、LLMを真の「ブラシ」として精密な視覚構築に活用する可能性は、ほとんど引き出されていない。本稿では、人間のアーティストのように——まず概念化し、次にスケッチし、最後に彩色する——エージェントが創造することを可能にするコード駆動型のエージェント画像生成パラダイム、GenClawを提案する。具体的には、まずエージェントが検索と推論を通じて概念知識と文脈を構築する。次にコード（例：SVG、HTML、Three.js）を用いて実行可能なビジュアルスケッチをレンダリングする。最後に画像生成モデルを用いてテクスチャ、マテリアル、フォトリアリズムを補完する。このワークフローにおいて、コードは言語推論とピクセル合成を橋渡しする制御可能な中間キャンバスとして機能し、プログラム的論理と生成モデルの視覚的表现力をシームレスに統合する。画像生成をブラックボックス的なパラダイムから、本物の人間の創作に近い段階的プロセスへと変革することにより、GenClawは高度に制御可能で解釈可能な視覚生成システムへの一歩を提供する。

English

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.