흐름 매칭을 위한 보상 역전파의 설계 공간 탐색

초록

텍스트-이미지 흐름 매칭 모델을 인간 선호도에 맞게 정렬하기 위해 직접 보상 역전파를 사용하는 방법은 샘플 효율적이지만, 두 가지 잘 알려진 병리 현상에 의해 방해를 받는다: 현대 모델 규모에서 전체 샘플링 궤적에 걸쳐 활성화를 저장할 수 없으며, 단계 간 연결된 야코비안 곱이 보상 기울기를 초기 인덱스로 역전파할 때 팽창시킨다. LeapAlign과 같은 커넥터 기반 방법은 전체 역방향 궤적을 짧은 고정 경로로 대체함으로써 이러한 문제를 해결하며, 샘플링과 최적화 사이의 유용한 분리를 강조한다. 그러나 결과 기울기의 품질은 이 짧은 경로가 특히 긴 구간에서 전체 롤아웃을 얼마나 정확하게 근사하는지에 달려 있다. 우리는 FlowBP를 제안한다. 이는 역방향 궤적 자체를 설계 대상으로 취급하는 통합 대체 궤적 프레임워크이다. FlowBP는 샘플링을 위해 기울기 없이 캐시된 롤아웃을 유지한 후, 캐시된 속도와 선택적으로 재순방향 전달된 속도로부터 경량의 역방향 대체 모델을 구축한다. 이러한 관점은 보상 모델 입력, 활성 집합, 적분 가중치, 브리지 결합의 네 가지 선택을 분리하며, 기존의 직접 기울기 방법을 특정 설정으로 복원한다. 이 프레임워크 내에서 우리는 세 가지 변형을 구체화한다: FlowBP-Sparse는 희소 오일러 재구성을 사용하고, FlowBP-Bridge는 제어된 브리지 결합을 추가하며, FlowBP-Lagrange는 도약 구적법의 차수를 높인다. 세 변형 모두 메모리를 활성 집합 크기로 제한하고 기울기 체이닝을 최대 하나의 야코비안 인자로 제한한다. SD3.5-M, FLUX.1-dev, FLUX.2-Klein-base에 대해 선호도, 품질, 구성적 지표에서 세 변형은 대부분의 지표에서 직접 기울기 기준선보다 개선된다.

English

Aligning text-to-image flow matching models with human preferences via direct reward backpropagation is sample-efficient but hampered by two well-known pathologies: activations cannot be stored across the full sampling trajectory at modern model scale, and chained Jacobian products across steps inflate the reward gradient as it travels back to early indices. Connector-based methods, such as LeapAlign, address these issues by replacing the full backward trajectory with a short pinned path, highlighting a useful decoupling between sampling and optimization. However, the quality of the resulting gradient depends on how accurately this short path approximates the full rollout, especially over long intervals. We propose FlowBP, a unified surrogate-trajectory framework that treats the backward trajectory itself as the design object. FlowBP keeps a no-gradient cached rollout for sampling, then builds a lightweight backward surrogate from cached and selectively re-forwarded velocities. This view separates four choices: the reward-model input, active set, integration weights, and bridge coupling, and recovers prior direct-gradient methods as particular settings. Within this framework, we instantiate three variants: FlowBP-Sparse uses sparse Euler reconstruction, FlowBP-Bridge adds controlled bridge coupling, and FlowBP-Lagrange raises the order of leap quadrature. All three bound memory by the active-set size and limit gradient chaining to at most one Jacobian factor. Across SD3.5-M, FLUX.1-dev, and FLUX.2-Klein-base on preference, quality, and compositional metrics, the three variants improve over direct-gradient baselines on most metrics.