UniGRPO: 추론 기반 시각 생성 통합 정책 최적화

초록

교차 생성이 가능한 통합 모델은 유망한 패러다임으로 부상했으며, 학계는 텍스트 생성에는 자기회귀 모델링을, 이미지 생성에는 플로우 매칭을 점점 더 수용하는 추세입니다. 이러한 방향성을 발전시키기 위해 본 연구에서는 교차 생성에 특화된 통합 강화학습 프레임워크를 제안합니다. 우리는 이 접근법의 기본 단위인 '단일 차례의 추론 기반 이미지 생성'(모델이 먼저 사용자 프롬프트를 추론을 통해 확장한 후 이미지를 합성하는 과정)을 통해 이를 검증합니다. 이 다중모달 생성 과정을 희소 종단 보상을 가진 마르코프 결정 과정으로 공식화하고, GRPO를 사용하여 텍스트 및 이미지 생성 정책을 공동 최적화하는 UniGRPO를 소개합니다. 과도한 설계를 피하는 미니멀리스트 방법론을 채택하여, 추론에는 표준 GRPO를, 시각적 합성에는 FlowGRPO를 원활하게 통합함으로써 두 양식에 대해 검증된 훈련 방법론을 활용합니다. 다중 차례 교차 생성으로의 확장성을 보장하기 위해 원본 FlowGRPO에 두 가지 중요한 수정을 가했습니다: (1) 다중 턴 상호작용 및 다중 조건 생성(예: 편집)과 같은 복잡한 시나리오로 확장하는 데 필수적인 선형적이고 분기되지 않은 롤아웃을 유지하기 위해 classifier-free guidance를 제거하고, (2) 기존의 잠재 공간 KL 패널티를 속도장에 직접 적용되는 MSE 패널티로 대체하여 보다 강력하고 직접적인 정규화 신호를 제공하여 보안 해킹을 효과적으로 완화합니다. 우리의 실험은 이 통합 훈련 방법이 추론을 통해 이미지 생성 품질을 크게 향상시킴을 보여주며, 완전한 교차 모델의 향후 사후 훈련을 위한 견고하고 확장 가능한 기준선을 제공합니다.

English

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

UniGRPO: 추론 기반 시각 생성 통합 정책 최적화

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

초록

Support