GoT: 시각적 생성 및 편집을 위한 멀티모달 대형 언어 모델의 추론 능력 발휘

초록

현재의 이미지 생성 및 편집 방법은 주로 텍스트 프롬프트를 직접 입력으로 처리하며, 시각적 구성과 명시적인 작업에 대한 추론을 수행하지 않습니다. 우리는 Generation Chain-of-Thought (GoT)라는 새로운 패러다임을 제시합니다. 이는 이미지를 출력하기 전에 명시적인 언어 추론 과정을 통해 생성과 편집을 가능하게 합니다. 이 접근 방식은 기존의 텍스트-이미지 생성 및 편집을 시맨틱 관계와 공간적 배열을 분석하는 추론-가이드 프레임워크로 변환합니다. 우리는 GoT의 공식을 정의하고, 시맨틱-공간적 관계를 포착한 상세한 추론 체인을 포함한 900만 개 이상의 샘플로 구성된 대규모 GoT 데이터셋을 구축했습니다. GoT의 장점을 활용하기 위해, 우리는 Qwen2.5-VL을 추론 체인 생성에 통합하고, 새로운 Semantic-Spatial Guidance Module로 강화된 엔드-투-엔드 확산 모델을 포함한 통합 프레임워크를 구현했습니다. 실험 결과, 우리의 GoT 프레임워크는 생성 및 편집 작업에서 우수한 성능을 보이며, 기준선 대비 상당한 개선을 달성했습니다. 또한, 우리의 접근 방식은 사용자가 추론 단계를 명시적으로 수정하여 정확한 이미지 조정을 가능하게 하는 인터랙티브 시각적 생성을 가능하게 합니다. GoT는 추론-주도 시각적 생성 및 편집을 위한 새로운 방향을 개척하며, 인간의 의도와 더 잘 일치하는 이미지를 생성합니다. 향후 연구를 촉진하기 위해, 우리는 데이터셋, 코드, 그리고 사전 훈련된 모델을 https://github.com/rongyaofang/GoT에서 공개합니다.

English

Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.

GoT: 시각적 생성 및 편집을 위한 멀티모달 대형 언어 모델의 추론 능력 발휘

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

초록

Support