ChatPaper.aiChatPaper

自奖励序贯蒙特卡洛方法在掩码扩散语言模型中的应用

Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

February 2, 2026
作者: Ziwei Luo, Ziqi Jin, Lei Wang, Lidong Bing, Thomas B. Schön
cs.AI

摘要

本文提出自激励序列蒙特卡洛方法(self-rewarding SMC),这是一种推理阶段扩展算法,能够有效采样掩码扩散语言模型(MDLM)。该算法的提出源于我们观察到:现有MDLM大多依赖基于置信度的采样策略,即在每一步仅保留预测置信度最高的标记。这种策略将生成过程限制在噪声敏感的贪婪解码范式内,导致可能路径的多样性不可避免地衰减。我们通过并行启动多个相互作用的扩散过程(称为粒子)进行轨迹探索来解决该问题。关键创新在于引入轨迹级置信度作为自激励信号,用于分配粒子重要性权重。在采样过程中,通过迭代式的粒子加权与重采样,系统性地引导生成过程朝向全局置信度高的优质样本。我们在多种掩码扩散语言模型和基准测试上验证了自激励SMC的有效性,该方法无需额外训练或奖励指导即可实现显著提升,并能将并行推理能力有效转化为采样质量的改进。代码已开源:https://github.com/Algolzw/self-rewarding-smc。
English
This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity of possible paths. We address this problem by launching multiple interacting diffusion processes in parallel, referred to as particles, for trajectory exploration. Importantly, we introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights. During sampling, particles are iteratively weighted and resampled to systematically steer generation towards globally confident, high-quality samples. Our self-rewarding SMC is verified on various masked diffusion language models and benchmarks, achieving significant improvement without extra training or reward guidance, while effectively converting parallel inference capacity into improved sampling quality. Our code is available at https://github.com/Algolzw/self-rewarding-smc.
PDF31February 6, 2026