Flow-OPD: 플로우 매칭 모델을 위한 온-폴리시 디스틸레이션

초록

기존의 플로우 매칭(FM) 기반 텍스트-이미지 모델은 다중 작업 정렬에서 두 가지 주요 병목 현상에 직면해 있습니다: 스칼라 값 보상에 의해 유발되는 보상 희소성과 이질적인 목표를 공동으로 최적화함으로써 발생하는 경사 간섭 현상이 그것으로, 이는 상호 경쟁적인 메트릭과 만연한 보상 해킹 현상을 야기하는 '시소 효과'를 초래합니다. 대규모 언어 모델 커뮤니티에서 온-폴리시 지식 증류(OPD)의 성공에 영감을 받아, 우리는 플로우 매칭 모델에 온-폴리시 지식 증류를 통합한 최초의 통합 사후 훈련 프레임워크인 Flow-OPD를 제안합니다. Flow-OPD는 두 단계의 정렬 전략을 채택합니다: 먼저 단일 보상 GRPO 미세 조정을 통해 도메인 특화된 교사 모델을 육성하여 각 전문가가 고립된 환경에서 성능 한계에 도달할 수 있도록 합니다. 그런 다음 플로우 기반 콜드-스타트 방식을 통해 강력한 초기 정책을 수립하고, 온-폴리시 샘플링, 작업-라우팅 라벨링, 밀집 궤적 수준 감독이라는 세 단계의 조정을 통해 이질적인 전문 지식을 단일 학생 모델로 원활하게 통합합니다. 우리는 더 나아가 매니폴드 앵커 정규화(MAR)를 도입하여, 작업-불가지론적 교사 모델을 활용하여 전체 데이터 감독을 제공함으로써 생성 과정을 고품질 매니폴드에 고정시키고, 순수 RL 기반 정렬에서 흔히 관찰되는 미적 저하를 효과적으로 완화합니다. Stable Diffusion 3.5 Medium을 기반으로 구축된 Flow-OPD는 GenEval 점수를 63에서 92로, OCR 정확도를 59에서 94로 향상시켜 기존 GRPO 대비 약 10점의 전반적인 개선을 달성했으며, 이미지 충실도와 인간 선호도 정렬을 유지하면서 '교사 모델을 능가하는' 돌발 현상을 보여줍니다. 이러한 결과는 Flow-OPD가 일반적인 텍스트-이미지 모델 구축을 위한 확장 가능한 정렬 패러다임으로 자리매김함을 입증합니다.

English

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

Flow-OPD: 플로우 매칭 모델을 위한 온-폴리시 디스틸레이션

Flow-OPD: On-Policy Distillation for Flow Matching Models

초록

Support