LeapAlign: 두 단계 궤적 구축을 통한 임의 생성 단계에서의 사후 학습 흐름 매칭 모델 정렬

초록

본 논문은 흐름 정합(flow matching) 모델의 인간 선호도 정렬(alignment)에 초점을 맞춥니다. 미분 가능한 생성 과정을 통해 보상 그래디언트를 직접 역전파하는 방식으로 미세 조정(fine-tuning)을 수행하는 것이 유망한 방법론입니다. 그러나 긴 궤적(trajectory)을 통해 역전파를 수행할 경우, 감당하기 어려운 메모리 비용과 그래디언트 폭발이 발생합니다. 이로 인해 직접 그래디언트(direct-gradient) 방법론은 최종 이미지의 전체적 구조를 결정하는 데 중요한 초기 생성 단계의 업데이트에 어려움을 겪습니다. 이러한 문제를 해결하기 위해, 본 논문은 계산 비용을 줄이고 보상으로부터 초기 생성 단계로의 직접적인 그래디언트 전파를 가능하게 하는 미세 조정 방법인 LeapAlign을 제안합니다. 구체적으로, 우리는 여러 ODE 샘플링 단계를 건너뛰고 미래 잠재 변수(latent)를 한 단계에서 예측하는 두 개의 연속적인 도약(leap)을 설계하여 긴 궤적을 단 두 단계로 단축합니다. 도약의 시작 및 종료 타임스텝을 무작위화함으로써, LeapAlign은 모든 생성 단계에서 효율적이고 안정적인 모델 업데이트를 가능하게 합니다. 이러한 단축된 궤적을 효과적으로 활용하기 위해, 우리는 긴 생성 경로와 더욱 일관성 있는 궤적에 더 높은 학습 가중치를 부여합니다. 그래디언트 안정성을 더욱 향상시키기 위해, 기존 연구에서처럼 큰 크기의 그래디언트 항을 완전히 제거하는 대신 해당 그래디언트 항의 가중치를 줄입니다. Flux 모델을 미세 조정할 때, LeapAlign은 다양한 평가 지표에서 최신 GRPO 기반 방법 및 직접 그래디언트 방법을 일관되게 능가하며, 우수한 이미지 품질과 이미지-텍스트 정렬도를 달성했습니다.

English

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

LeapAlign: 두 단계 궤적 구축을 통한 임의 생성 단계에서의 사후 학습 흐름 매칭 모델 정렬

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

초록

Support