高效並行採樣器在遞歸深度模型中的應用及其與擴散語言模型的關聯
Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models
October 16, 2025
作者: Jonas Geiping, Xinyu Yang, Guinan Su
cs.AI
摘要
具有循环深度的語言模型,在考慮變壓器時亦被稱為通用或循環模型,其定義在於能夠通過層的重複來增加計算能力。近期的預訓練研究表明,這些架構能夠擴展至現代語言建模任務,並在推理任務中展現出優勢。在本研究中,我們探討了循環深度模型與擴散語言模型之間的關係。基於它們的相似性,我們為這些模型開發了一種新的擴散強制採樣器,以加速生成過程。該採樣器通過在模型的每次前向傳播中解碼新詞元來推進,而這些詞元的潛在狀態可以通過循環並行地進一步精煉。理論上,使用我們的採樣器進行生成,在現代硬件上相同的時間預算內,其表達能力嚴格優於基線的自回歸生成。此外,這一基於擴散文獻原理的採樣器,無需任何調優即可直接應用於現有的35億參數循環深度變壓器,從而實現高達5倍的加速。因此,我們的研究不僅提供了一種在推理時並行化循環深度模型中額外計算的有效機制,還表明此類模型可自然地視為強大的連續(儘管是因果的)擴散語言模型。
English
Language models with recurrent depth, also referred to as universal or looped
when considering transformers, are defined by the capacity to increase their
computation through the repetition of layers. Recent efforts in pretraining
have demonstrated that these architectures can scale to modern language
modeling tasks while exhibiting advantages in reasoning tasks. In this work, we
examine the relationship between recurrent-depth models and diffusion language
models. Building on their similarities, we develop a new diffusion forcing
sampler for these models to accelerate generation. The sampler advances by
decoding new tokens at every forward pass of the model, while the latent states
of these tokens can be further refined in parallel through recurrence.
Theoretically, generation with our sampler is strictly more expressive than the
baseline autoregressive generation using the same time budget on modern
hardware. Moreover, this sampler, based on principles from diffusion
literature, can be directly applied to existing 3.5B recurrent-depth
transformers without any tuning, leading to up to a 5x speedup. Consequently,
our findings not only provide an efficient mechanism for parallelizing the
extra computation in recurrent-depth models at inference, but also suggest that
such models can be naturally viewed as strong continuous, though causal,
diffusion language models.