我們能否使用對抗式訓練生成圖像?讓我們逐步驗證和強化圖像生成。
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
January 23, 2025
作者: Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng
cs.AI
摘要
在大型模型中廣泛探索的思維鏈(Chain-of-Thought,CoT)推理已用於應對複雜的理解任務。然而,尚不清楚這種策略是否適用於驗證和強化圖像生成方案。本文首次全面探討了CoT推理對增強自回歸圖像生成的潛力。我們專注於三種技術:為驗證擴展測試時間計算、將模型偏好與直接偏好優化(Direct Preference Optimization,DPO)對齊,以及將這些技術整合以產生互補效應。我們的結果表明,這些方法可以被有效地適應和結合,以顯著提高圖像生成性能。此外,考慮到獎勵模型在我們研究中的關鍵作用,我們提出了專為自回歸圖像生成而設的潛在評估獎勵模型(Potential Assessment Reward Model,PARM)和 PARM++。PARM通過潛在評估方法自適應評估每個生成步驟,融合現有獎勵模型的優勢,而PARM++進一步引入反射機制,以自我校正生成的不滿意圖像。利用我們研究的推理策略,我們增強了基準模型 Show-o,實現了卓越的結果,在 GenEval 基準測試中實現了顯著的 +24% 改善,超越 Stable Diffusion 3 的 +15%。我們希望本研究提供獨特的見解,為將CoT推理與自回歸圖像生成相結合開辟新途徑。代碼和模型已在 https://github.com/ZiyuGuo99/Image-Generation-CoT 釋出。
English
Chain-of-Thought (CoT) reasoning has been extensively explored in large
models to tackle complex understanding tasks. However, it still remains an open
question whether such strategies can be applied to verifying and reinforcing
image generation scenarios. In this paper, we provide the first comprehensive
investigation of the potential of CoT reasoning to enhance autoregressive image
generation. We focus on three techniques: scaling test-time computation for
verification, aligning model preferences with Direct Preference Optimization
(DPO), and integrating these techniques for complementary effects. Our results
demonstrate that these approaches can be effectively adapted and combined to
significantly improve image generation performance. Furthermore, given the
pivotal role of reward models in our findings, we propose the Potential
Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image
generation. PARM adaptively assesses each generation step through a potential
assessment approach, merging the strengths of existing reward models, and
PARM++ further introduces a reflection mechanism to self-correct the generated
unsatisfactory image. Using our investigated reasoning strategies, we enhance a
baseline model, Show-o, to achieve superior results, with a significant +24%
improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We
hope our study provides unique insights and paves a new path for integrating
CoT reasoning with autoregressive image generation. Code and models are
released at https://github.com/ZiyuGuo99/Image-Generation-CoTSummary
AI-Generated Summary