高效并行采样器在循环深度模型中的应用及其与扩散语言模型的关联
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
October 16, 2025
作者: Jonas Geiping, Xinyu Yang, Guinan Su
cs.AI
摘要
具有循环深度的语言模型,在考虑Transformer架构时也被称为通用或循环模型,其特点在于能够通过层重复来增强计算能力。最近的预训练研究表明,这些架构能够扩展到现代语言建模任务,并在推理任务中展现出优势。在本研究中,我们探讨了循环深度模型与扩散语言模型之间的关系。基于它们的相似性,我们为这些模型开发了一种新的扩散强制采样器,以加速生成过程。该采样器通过在模型的每次前向传递中解码新令牌来推进,而这些令牌的潜在状态可以通过循环并行进一步优化。理论上,在现代硬件上,使用我们的采样器进行生成,在相同的时间预算下,其表达能力严格优于基线自回归生成方法。此外,这种基于扩散文献原理的采样器,无需任何调整即可直接应用于现有的35亿参数循环深度Transformer,实现高达5倍的加速。因此,我们的发现不仅为推理时并行化循环深度模型中的额外计算提供了一种高效机制,还表明这类模型可以自然地被视为强大的连续(尽管是因果的)扩散语言模型。
English
Language models with recurrent depth, also referred to as universal or looped
when considering transformers, are defined by the capacity to increase their
computation through the repetition of layers. Recent efforts in pretraining
have demonstrated that these architectures can scale to modern language
modeling tasks while exhibiting advantages in reasoning tasks. In this work, we
examine the relationship between recurrent-depth models and diffusion language
models. Building on their similarities, we develop a new diffusion forcing
sampler for these models to accelerate generation. The sampler advances by
decoding new tokens at every forward pass of the model, while the latent states
of these tokens can be further refined in parallel through recurrence.
Theoretically, generation with our sampler is strictly more expressive than the
baseline autoregressive generation using the same time budget on modern
hardware. Moreover, this sampler, based on principles from diffusion
literature, can be directly applied to existing 3.5B recurrent-depth
transformers without any tuning, leading to up to a 5x speedup. Consequently,
our findings not only provide an efficient mechanism for parallelizing the
extra computation in recurrent-depth models at inference, but also suggest that
such models can be naturally viewed as strong continuous, though causal,
diffusion language models.