확산 정렬을 위한 스티치드 가치 모델

초록

실용적인 사용을 위해, 확산 또는 흐름 기반 생성 모델은 프롬프트 충실도나 심미적 선호도와 같은 작업별 보상과 정렬되어야 합니다. 이러한 정렬은 보상이 깨끗한 출력 이미지에 대해 정의되지만, 정렬 절차에서는 노이즈가 있는 중간 잠재 변수에 대한 가치 함수 추정이 필요하기 때문에 어렵습니다. 기존 방법은 Tweedie 스타일 또는 몬테카를로 근사에 의존하여 추정량 편향과 계산 비용 사이의 균형을 맞춥니다. Tweedie 추정은 효율적이지만 편향되어 있고, 몬테카를로 추정은 더 정확하지만 비용이 많이 드는 롤아웃을 필요로 합니다. 자연스러운 대안은 학습된 가치 함수이지만, 특히 노이즈가 있는 잠재 변수에 대해 강력하고 일반적인 가치 모델을 효과적으로 훈련하는 방법은 여전히 미해결 질문입니다. 여기서 우리는 깨끗한 이미지에 대해 사전 훈련된 보상 모델을 노이즈가 있는 잠재 변수 영역으로 효율적으로 전이하는 모델 스티칭 프레임워크인 StitchVM을 제안합니다. StitchVM은 기존의 잘린 픽셀 공간 보상 모델에서 시작하여 여기에 동결된 확산 백본을 헤드로 부착합니다. 픽셀 공간 모델로부터 결과 하이브리드는 신중하게 사전 훈련된 강건한 보상 기능을 유지하고, 확산 백본으로부터는 노이즈가 있는 잠재 변수를 처리하는 고유의 능력을 상속받습니다. 스티칭 절차는 매우 가벼워서, 예를 들어 CLIP ViT-L과 SD 3.5 Medium을 스티칭하고 미세 조정하는 데 단 10 GPU 시간만 소요됩니다. 강력한 픽셀 공간 보상 모델을 잠재 공간으로 끌어올림으로써, StitchVM은 새로운 스타일의 확산 정렬을 열어줍니다. 즉, 거칠지만 비용이 많이 드는 샘플별 가치 함수 근사 대신, 실제 노이즈가 있는 잠재 변수에 대한 올바른 함수를 한 번 구축한 후 많은 샘플과 반복에 걸쳐 상각하는 방식입니다. 우리는 이 접근 방식이 다양한 하위 스티어링 및 사후 훈련 방법에서 개선을 가져옴을 보여줍니다. DPS는 최대 GPU 메모리를 절반으로 줄이면서 3.2배 빨라지고, DiffusionNFT는 2.3배 빨라집니다.

English

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes 3.2times faster while halving peak GPU memory, and DiffusionNFT becomes 2.3times faster.