Ψ-샘플러: 점수 모델에서 SMC 기반 추론 시점 보상 정렬을 위한 초기 입자 샘플링

초록

우리는 Psi-Sampler를 소개합니다. 이는 pCNL 기반 초기 입자 샘플링을 통합한 SMC 기반 프레임워크로, 스코어 기반 생성 모델과의 추론 시점 보상 정렬을 효과적으로 수행합니다. 스코어 기반 생성 모델과의 추론 시점 보상 정렬은 최근 사전 학습에서 사후 학습 최적화로의 더 넓은 패러다임 전환에 따라 상당한 주목을 받고 있습니다. 이 트렌드의 핵심은 Sequential Monte Carlo(SMC)를 디노이징 프로세스에 적용하는 것입니다. 그러나 기존 방법들은 일반적으로 가우시안 사전 분포에서 입자를 초기화하는데, 이는 보상 관련 영역을 충분히 포착하지 못하고 샘플링 효율성을 감소시킵니다. 우리는 보상을 고려한 사후 분포에서 초기화하는 것이 정렬 성능을 크게 향상시킨다는 것을 보여줍니다. 고차원 잠재 공간에서의 사후 샘플링을 가능하게 하기 위해, 우리는 차원에 강건한 제안 분포와 그래디언트 정보를 활용한 역학을 결합한 전처리된 Crank-Nicolson Langevin(pCNL) 알고리즘을 도입했습니다. 이 접근법은 효율적이고 확장 가능한 사후 샘플링을 가능하게 하며, 레이아웃-투-이미지 생성, 수량 인식 생성, 미적 선호도 생성 등 다양한 보상 정렬 작업에서 일관되게 성능을 개선합니다. 이는 우리의 실험을 통해 입증되었습니다.

English

We introduce Psi-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.

Ψ-샘플러: 점수 모델에서 SMC 기반 추론 시점 보상 정렬을 위한 초기 입자 샘플링

Ψ-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

초록

Support