ChatPaper.aiChatPaper

T2I-R1:通過協作語義層次與詞元層次思維鏈強化圖像生成

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

May 1, 2025
作者: Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
cs.AI

摘要

近期大型語言模型的進展展示了如何通過思維鏈(CoT)和強化學習(RL)來提升性能。然而,將此類推理策略應用於視覺生成領域仍處於探索階段。本文介紹了T2I-R1,這是一種新穎的推理增強型文本到圖像生成模型,其核心在於結合了雙層CoT推理過程的強化學習。具體而言,我們識別出兩個層次的CoT,可用於增強生成過程的不同階段:(1)語義層次的CoT,用於提示的高層次規劃;(2)詞元層次的CoT,用於逐塊生成過程中的低層次像素處理。為更好地協調這兩層CoT,我們引入了BiCoT-GRPO,它集成了一系列生成獎勵,能夠在同一訓練步驟中無縫優化這兩種生成CoT。通過將我們的推理策略應用於基礎模型Janus-Pro,我們在T2I-CompBench上實現了13%的性能提升,在WISE基準測試上提升了19%,甚至超越了當前最先進的模型FLUX。代碼已開源於:https://github.com/CaraJ7/T2I-R1。
English
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1

Summary

AI-Generated Summary

PDF371May 4, 2025