생성과 동시에 사고하기: 시각적 생성 과정에 걸친 텍스트 추론의 인터리빙

초록

시각적 생성 분야의 최근 발전은 추론 능력의 통합을 점점 더 탐구하고 있습니다. 기존 연구는 생성 과정 전(사전 계획으로서)이나 후(사후 정제로서)에 텍스트 추론, 즉 '생각하기'를 도입했으나, 생성 과정 중 실시간 다중모드 상호작용은 부족했습니다. 본 예비 연구에서는 시각적 생성 과정 전반에 걸쳐 텍스트 추론이 공진화하도록 하는 최초의 인터리빙(interleaved) 프레임워크인 Thinking-while-Generating(TwiG)을 소개합니다. 시각 콘텐츠가 점진적으로 생성됨에 따라 텍스트 추론이 인터리빙되어 다가올 지역적 영역을 안내하고 이전에 합성된 영역을 반성합니다. 이러한 동적 상호작용은 더욱 상황 인식적이고 의미론적으로 풍부한 시각적 결과물을 생성합니다. 이 프레임워크의 잠재력을 규명하기 위해 우리는 세 가지 후보 전략, 즉 제로샷 프롬프팅, 우리가 구축한 TwiG-50K 데이터셋을 이용한 지도 미세 조정(SFT), 그리고 맞춤형 TwiG-GRPO 전략을 통한 강화 학습(RL)을 조사하며, 각각이 인터리빙 추론의 역학에 대한 고유한 통찰력을 제공합니다. 이 연구가 향상된 시각적 생성을 위한 텍스트 추론 인터리빙의 추가 연구에 영감을 주기를 바랍니다. 코드는 https://github.com/ZiyuGuo99/Thinking-while-Generating 에 공개될 예정입니다.

English

Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.