自奖励序列蒙特卡洛方法在掩码扩散语言模型中的应用

摘要

本文提出自奖励序列蒙特卡洛方法（self-rewarding SMC），这是一种推理阶段扩展算法，能够有效采样掩码扩散语言模型（MDLM）。我们的算法源于以下观察：现有大多数MDLM依赖基于置信度的采样策略，即在每一步仅保留预测置信度最高的词元。这种做法将生成过程限制在易受噪声影响的贪婪解码范式内，导致可能路径的多样性不可避免地坍缩。为解决该问题，我们通过并行启动多个相互作用的扩散过程（称为粒子）进行轨迹探索。关键创新在于引入轨迹级置信度作为自奖励信号，用于分配粒子重要性权重。在采样过程中，粒子通过迭代加权和重采样，系统性地引导生成朝向全局置信度高、质量优良的样本。我们在多种掩码扩散语言模型和基准测试上验证了自奖励SMC的有效性，该方法无需额外训练或奖励引导即可实现显著提升，同时能有效将并行推理能力转化为采样质量的改进。代码已开源：https://github.com/Algolzw/self-rewarding-smc。

English

This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity of possible paths. We address this problem by launching multiple interacting diffusion processes in parallel, referred to as particles, for trajectory exploration. Importantly, we introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights. During sampling, particles are iteratively weighted and resampled to systematically steer generation towards globally confident, high-quality samples. Our self-rewarding SMC is verified on various masked diffusion language models and benchmarks, achieving significant improvement without extra training or reward guidance, while effectively converting parallel inference capacity into improved sampling quality. Our code is available at https://github.com/Algolzw/self-rewarding-smc.

自奖励序列蒙特卡洛方法在掩码扩散语言模型中的应用

Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

摘要

Support