拡散アライメントのための縫合型価値モデル

要旨

実用化にあたっては、拡散モデルやフローベース生成モデルは、プロンプトへの忠実性や美的嗜好など、タスク固有の報酬に合わせて調整（アライメント）する必要がある。このアライメントは、報酬がクリーンな出力画像に対して定義されている一方で、アライメント手順ではノイズを含む中間潜在変数に対する価値関数の推定が必要となるため、困難を伴う。既存手法では、ツイーディ型推定またはモンテカルロ近似に頼っており、推定バイアスと計算コストのトレードオフが生じる。すなわち、ツイーディ型推定は効率的だがバイアスがかかり、モンテカルロ推定はより正確だが高価なロールアウトを必要とする。自然な代替案として学習された価値関数が考えられるが、特にノイズを含む潜在変数に対して、強力で汎用的な価値モデルを効果的に訓練する方法は未解決の課題である。本稿では、クリーンな画像用に事前学習された報酬モデルを、ノイズを含む潜在変数の領域へ効率的に転送するモデルステッチングフレームワークであるStitchVMを提案する。StitchVMは、既存の途中まで切り詰めたピクセル空間報酬モデルを出発点とし、そのヘッドとして凍結された拡散バックボーンを結合する。得られたハイブリッドモデルは、ピクセル空間モデルからは注意深く事前学習された頑健な報酬能力を保持し、拡散バックボーンからはノイズを含む潜在変数を扱う本来の能力を受け継ぐ。このステッチング手順は非常に軽量であり、例えばCLIP ViT-LとSD 3.5 Mediumのステッチングとファインチューニングには、わずか10 GPU時間しかかからない。強力なピクセル空間報酬モデルを潜在空間に持ち上げることで、StitchVMは新たなスタイルの拡散アライメントを切り拓く。すなわち、価値関数をサンプルごとに大まかでありながら高コストな近似に頼る代わりに、実際のノイズを含む潜在変数に対する正しい関数を一度構築し、それを多数のサンプルと反復にわたって償却するのである。本手法が、下流の制御手法や事後訓練手法の広い範囲にわたって改善をもたらすことを示す。DPSは3.2倍高速化され、ピークGPUメモリは半減し、DiffusionNFTは2.3倍高速化される。

English

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes 3.2times faster while halving peak GPU memory, and DiffusionNFT becomes 2.3times faster.