PromptRL:提示策略在基于流的图像生成强化学习中的关键作用
PromptRL: Prompt Matters in RL for Flow-Based Image Generation
February 1, 2026
作者: Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park
cs.AI
摘要
流匹配模型已彻底改变文本到图像生成领域,其中强化学习作为关键的后训练策略用于实现奖励目标对齐。本研究发现,当前流匹配模型的强化学习流程存在两个未被充分重视但至关重要的问题:因生成多样性不足导致的样本低效性,以及显著的提示词过拟合现象——模型会机械记忆特定训练表述,在面对语义相同但风格变化的提示词时出现性能断崖式下跌。我们提出PromptRL(基于流模型的图像生成中提示词优化的强化学习框架),将语言模型作为可训练的提示词优化智能体直接嵌入流式强化学习优化循环。该设计产生两项互补优势:快速形成复杂提示词重写能力,以及重塑优化动态的协同训练机制。PromptRL在多个基准测试中实现最先进性能,在GenEval上获得0.97分,OCR准确率0.98分,PickScore得分24.05。
此外,我们在大规模图像编辑模型上验证了该强化学习方法的有效性,仅用6万次推演就将FLUX.1-Kontext的EditReward从1.19提升至1.43,超越得分为1.37的Gemini 2.5 Flash Image(亦称Nano Banana),并与依赖细粒度数据标注和复杂多阶段训练的ReasonNet(1.44分)达到相当性能。大量实验证实,相比纯流模型强化学习,PromptRL能以超过2倍的样本效率持续达到更高性能上限。代码已开源:https://github.com/G-U-N/UniRL。
English
Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore.
Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2times fewer rollouts compared to naive flow-only RL. Our code is available at https://github.com/G-U-N/UniRL.