RePrompt：基於強化學習的推理增強型重新提示技術在文本到圖像生成中的應用

摘要

儘管文本到圖像（T2I）生成技術近期取得了進展，現有模型在從簡短且未充分指定的提示中準確捕捉用戶意圖方面仍面臨挑戰。雖然先前的研究嘗試利用大型語言模型（LLMs）來增強提示，但這些方法由於缺乏對視覺語義和現實世界構圖的充分基礎，常常生成風格化或不切實際的內容。受到語言模型推理最新進展的啟發，我們提出了RePrompt，這是一種新穎的重提示框架，通過強化學習將顯式推理引入提示增強過程。與依賴手工規則或風格化重寫不同，我們的方法訓練語言模型生成結構化、自我反思的提示，通過優化圖像級結果來實現。定制的獎勵模型從人類偏好、語義對齊和視覺構圖等方面評估生成的圖像，為提示生成提供間接監督。我們的方法實現了無需人工註釋數據的端到端訓練。在GenEval和T2I-Compbench上的實驗表明，RePrompt顯著提升了多種T2I骨幹模型的空間佈局保真度和組合泛化能力，建立了新的最先進成果。

English

Despite recent progress in text-to-image (T2I) generation, existing models often struggle to faithfully capture user intentions from short and under-specified prompts. While prior work has attempted to enhance prompts using large language models (LLMs), these methods frequently generate stylistic or unrealistic content due to insufficient grounding in visual semantics and real-world composition. Inspired by recent advances in reasoning for language model, we propose RePrompt, a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning. Instead of relying on handcrafted rules or stylistic rewrites, our method trains a language model to generate structured, self-reflective prompts by optimizing for image-level outcomes. The tailored reward models assesse the generated images in terms of human preference, semantic alignment, and visual composition, providing indirect supervision to refine prompt generation. Our approach enables end-to-end training without human-annotated data. Experiments on GenEval and T2I-Compbench show that RePrompt significantly boosts spatial layout fidelity and compositional generalization across diverse T2I backbones, establishing new state-of-the-art results.

RePrompt：基於強化學習的推理增強型重新提示技術在文本到圖像生成中的應用

RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

摘要

Support