CoTを使用して画像を生成できるかどうかを検証し、画像生成を段階的に検証および強化しましょう。

要旨

Chain-of-Thought (CoT) 推論は、複雑な理解タスクに取り組むために大規模なモデルで広く研究されてきました。しかし、画像生成シナリオの検証や強化にこのような戦略が適用可能かどうかは未解決の問題です。本論文では、CoT 推論の潜在能力を活用して自己回帰型画像生成を向上させる可能性について初めて包括的な調査を提供します。我々は、検証のためのテスト時計算のスケーリング、モデルの嗜好を直接的な嗜好最適化（DPO）と整合させること、およびこれらの技術を補完的な効果のために統合することに焦点を当てます。我々の結果は、これらのアプローチが効果的に適応および組み合わせられ、画像生成の性能を著しく向上させることが示されています。さらに、報酬モデルの重要な役割を考慮し、自己回帰型画像生成向けに特化した Potential Assessment Reward Model（PARM）および PARM++ を提案します。PARM は、潜在評価アプローチを通じて各生成ステップを適応的に評価し、既存の報酬モデルの強みを統合します。そして、PARM++ は、生成された不十分な画像を自己修正する反射メカニズムをさらに導入します。調査された推論戦略を用いて、基準モデルである Show-o を強化し、GenEval ベンチマークで+24%の著しい改善を達成し、Stable Diffusion 3 を+15%上回りました。我々の研究が独自の洞察を提供し、CoT 推論を自己回帰型画像生成と統合する新たな道筋を築く手助けとなることを願っています。コードとモデルは https://github.com/ZiyuGuo99/Image-Generation-CoT で公開されています。

English

Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT