Flow-OPD：基於流匹配模型的策略內蒸餾技術

摘要

在传统中文学术文献的表述规范下，译文如下：现有流匹配（FM）文生图模型面临多任务对齐下的双重瓶颈：标量奖励导致的奖励稀疏性，以及异质目标联合优化引发的梯度干扰，二者共同导致指标相互掣肘的"跷跷板效应"及普遍存在的奖励破解现象。受大语言模型领域策略蒸馏（OPD）成功实践的启发，我们提出Flow-OPD——首个将策略蒸馏整合至流匹配模型的统一后训练框架。该框架采用两阶段对齐策略：首先通过单奖励GRPO微调培育领域专精教师模型，使各专家模型能独立达到性能上限；继而基于流式冷启动方案建立稳健初始策略，并通过策略采样、任务路由标注、密集轨迹级监督的三步协同，将异质专家知识无缝整合至单一学生模型。我们进一步提出流形锚定正则化（MAR），利用任务无关教师模型提供全数据监督，将生成结果锚定于高质量流形，有效缓解纯强化学习对齐常见的审美退化问题。基于Stable Diffusion 3.5 Medium构建的Flow-OPD将GenEval分数从63提升至92，OCR准确率从59提升至94，较原始GRPO实现约10个百分点的综合提升，在保持图像保真度与人类偏好对齐的同时，展现出超越教师模型的涌现特性。这些成果确立了Flow-OPD作为构建通用文生图模型的可扩展对齐范式。

English

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

Flow-OPD：基於流匹配模型的策略內蒸餾技術

Flow-OPD: On-Policy Distillation for Flow Matching Models

摘要

Support