GenClaw: 코드 기반 에이전트 이미지 생성

초록

이미지 생성 모델은 텍스트 기반 픽셀 합성에서 시각적 이해 및 도구 호출 기능을 갖춘 다중 모달 에이전트로 발전해왔다. 그러나 기존 에이전트들은 여전히 내부 블랙박스 이미지 모델에 종속되어 있다. 이들의 워크플로우는 생성 품질 향상을 위한 프롬프트 재작성의 반복적인 순환에 갇혀 있으며, 캔버스를 직접 조작할 수 있는 메커니즘이 부재하다. 본질적으로, 정밀한 시각적 구성을 위한 진정한 '붓' 역할을 수행할 수 있는 LLM의 잠재력은 아직 충분히 활용되지 못하고 있다. 본 논문에서는 인간 예술가가 먼저 개념을 구상하고, 이어서 스케치를 그린 후, 마지막으로 채색하는 과정처럼 에이전트가 창작할 수 있도록 하는 코드 기반 에이전트 이미지 생성 패러다임인 GenClaw를 제안한다. 구체적으로, 에이전트는 먼저 검색과 추론을 통해 개념적 지식과 맥락을 구성한다. 그런 다음 코드(예: SVG, HTML, Three.js)를 활용하여 실행 가능한 시각적 스케치를 렌더링한다. 마지막으로 이미지 생성 모델을 사용하여 텍스처, 재질, 사실성을 보완한다. 이 워크플로우에서 코드는 언어적 추론과 픽셀 합성을 연결하는 제어 가능한 중간 캔버스 역할을 하며, 프로그래매틱 로직과 생성 모델의 시각적 표현력을 원활하게 통합한다. 이미지 생성을 블랙박스 패러다임에서 진정한 인간 창작과 유사한 단계적 과정으로 변환함으로써, GenClaw는 고도로 제어 가능하고 해석 가능한 시각적 생성 시스템을 위한 한 걸음 나아간 접근법을 제시한다.

English

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.