더 나은 텍스트-이미지 생성을 위한 인터리빙 추론

초록

최근 통합 멀티모달 이해 및 생성 모델은 이미지 생성 능력에서 상당한 개선을 이루었으나, GPT-4o와 같이 이해와 생성을 긴밀하게 결합한 시스템에 비해 명령어 수행 및 세부 사항 보존 측면에서 큰 격차가 남아 있습니다. 최근의 인터리빙 추론(interleaving reasoning) 기술 발전에 영감을 받아, 이러한 추론이 텍스트-이미지(T2I) 생성 능력을 더욱 향상시킬 수 있는지 탐구합니다. 우리는 인터리빙 추론 생성(Interleaving Reasoning Generation, IRG) 프레임워크를 소개합니다. 이 프레임워크는 텍스트 기반 사고와 이미지 합성을 번갈아 수행합니다: 모델은 먼저 초기 이미지를 안내하기 위한 텍스트 기반 사고를 생성한 후, 결과를 반영하여 세부 사항, 시각적 품질, 미학을 개선하면서 의미론적 요소를 보존합니다. IRG를 효과적으로 학습하기 위해, 우리는 인터리빙 추론 생성 학습(Interleaving Reasoning Generation Learning, IRGL)을 제안합니다. 이는 두 가지 하위 목표를 달성합니다: (1) 초기 사고 및 생성 단계를 강화하여 핵심 콘텐츠와 기본 품질을 확립하고, (2) 고품질의 텍스트 반영과 이러한 개선 사항을 후속 이미지에 충실히 구현하는 능력을 가능하게 합니다. 우리는 IRGL-300K 데이터셋을 구축했으며, 이는 텍스트 기반 사고와 전체 사고-이미지 궤적을 함께 다루는 여섯 가지 분해된 학습 모드로 구성됩니다. 인터리빙 텍스트-이미지 출력을 기본적으로 생성하는 통합 기반 모델에서 출발하여, 두 단계의 학습은 먼저 강력한 사고와 반영 능력을 구축한 후, 전체 사고-이미지 궤적 데이터에서 IRG 파이프라인을 효율적으로 조정합니다. 광범위한 실험 결과, GenEval, WISE, TIIF, GenAI-Bench, OneIG-EN에서 5-10점의 절대적 성능 향상을 보였으며, 시각적 품질과 세부 사항 정확도에서도 상당한 개선이 확인되었습니다. 코드, 모델 가중치 및 데이터셋은 https://github.com/Osilly/Interleaving-Reasoning-Generation 에 공개될 예정입니다.

English

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

더 나은 텍스트-이미지 생성을 위한 인터리빙 추론

Interleaving Reasoning for Better Text-to-Image Generation

초록

Support