Ψ-採樣器：基於順序蒙特卡羅的推理時間獎勵對齊在評分模型中的初始粒子採樣

摘要

我們介紹了Psi-Sampler，這是一個基於SMC（序列蒙特卡羅）的框架，結合了基於pCNL（預條件克蘭克-尼科爾森朗之萬）的初始粒子採樣，旨在實現與基於分數的生成模型在推理時的有效獎勵對齊。隨著從預訓練到後訓練優化的廣泛範式轉變，基於分數的生成模型在推理時的獎勵對齊最近獲得了顯著的關注。這一趨勢的核心是將序列蒙特卡羅（SMC）應用於去噪過程。然而，現有方法通常從高斯先驗初始化粒子，這無法充分捕捉與獎勵相關的區域，導致採樣效率降低。我們證明，從獎勵感知的後驗分佈初始化粒子能顯著提升對齊性能。為了在高維潛在空間中實現後驗採樣，我們引入了預條件克蘭克-尼科爾森朗之萬（pCNL）算法，該算法結合了維度魯棒的提議與梯度引導的動態。這種方法實現了高效且可擴展的後驗採樣，並在各種獎勵對齊任務中持續提升性能，包括佈局到圖像生成、數量感知生成和審美偏好生成，正如我們實驗中所展示的那樣。

English

We introduce Psi-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.

Ψ-採樣器：基於順序蒙特卡羅的推理時間獎勵對齊在評分模型中的初始粒子採樣

Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

摘要

Support