用于扩散对齐的拼接价值模型

摘要

在实际应用中，基于扩散或流的生成模型必须与任务特定奖励对齐，例如提示忠实度或美学偏好。这种对齐具有挑战性，因为奖励是针对干净输出图像定义的，但对齐过程需要在带噪的中间隐变量处估计价值函数。现有方法采用Tweedie式或蒙特卡洛近似，在估计偏差与计算成本之间权衡：Tweedie估计高效但有偏，而蒙特卡洛估计更准确但需要昂贵的 rollout。一个自然的替代方案是学习一个价值函数，但如何有效训练一个针对带噪隐变量的强健且通用的价值模型仍是一个开放问题。本文提出StitchVM，一种模型拼接框架，能够高效地将预训练用于干净图像的奖励模型迁移到带噪隐变量领域。StitchVM从现有的、截断的像素空间奖励模型出发，将一个冻结的扩散主干作为其头部附加其上。从像素空间模型中，生成的混合模型保留了经过精心预训练的、鲁棒的奖励能力；从扩散主干中，它继承了处理带噪隐变量的原生能力。拼接过程异常轻量，例如拼接并微调CLIP ViT-L和SD 3.5 Medium仅需10 GPU小时。通过将强大的像素空间奖励模型提升到隐空间，StitchVM开启了一种新的扩散对齐风格：不再采用粗糙但成本高昂的逐样本价值函数近似，而是针对实际的带噪隐变量一次性构建正确的函数，并在多个样本和迭代中分摊成本。我们表明，该方法在一系列下游引导和后训练方法中带来改进：DPS速度提升3.2倍，同时峰值GPU内存减半；DiffusionNFT速度提升2.3倍。

English

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes 3.2times faster while halving peak GPU memory, and DiffusionNFT becomes 2.3times faster.