Flow-OPD: On-Policy Distillatie voor Flow Matching Modellen

Samenvatting

Bestaande Flow Matching (FM) tekst-naar-beeldmodellen hebben te maken met twee kritieke knelpunten bij multi-task afstemming: de beloningsschaarste als gevolg van scalaire beloningen, en de gradientinterferentie die ontstaat door het gezamenlijk optimaliseren van heterogene doelstellingen. Samen leiden ze tot een 'seesaw-effect' van concurrerende metrieken en alomtegenwoordig reward hacking. Geïnspireerd door het succes van On-Policy Distillation (OPD) in de community van grote taalmodellen, introduceren we Flow-OPD, het eerste geïntegreerde post-training raamwerk dat on-policy destillatie in Flow Matching modellen integreert. Flow-OPD hanteert een tweetraps afstemmingsstrategie: het kweekt eerst domeinspecialistische docentmodellen via enkelbelonings GRPO fine-tuning, zodat elke expert zijn prestatiedak in isolatie kan bereiken; vervolgens wordt een robuust initieel policy gevestigd via een Flow-gebaseerd Cold-Start schema, waarna heterogene expertise naadloos wordt samengebracht in één enkele student via een driefasige orkestratie van on-policy sampling, taakrouteringslabeling en dichte supervisie op trajectniveau. We introduceren verder Manifold Anchor Regularization (MAR), dat een taakagnostische docent inzet voor volledige data-supervisie, waarmee generatie wordt verankerd aan een hoogwaardige manifold. Dit beperkt effectief de esthetische degradatie die vaak wordt waargenomen bij zuiver RL-gedreven afstemming. Gebouwd op Stable Diffusion 3.5 Medium, verhoogt Flow-OPD de GenEval-score van 63 naar 92 en de OCR-nauwkeurigheid van 59 naar 94, wat resulteert in een algehele verbetering van ruwweg 10 punten ten opzichte van standaard GRPO, terwijl beeldgetrouwheid en menselijke voorkeursafstemming behouden blijven en een opkomend 'docent-overtreffend' effect wordt vertoond. Deze resultaten vestigen Flow-OPD als een schaalbaar afstemmingsparadigma voor het bouwen van generalistische tekst-naar-beeldmodellen.

English

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.