通過隱藏半自迴歸專家實現擴散式大語言模型的測試時縮放
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
October 6, 2025
作者: Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
cs.AI
摘要
基於擴散的大型語言模型(dLLMs)在訓練時被靈活地設計來模擬數據分佈中的極端依賴性;然而,在推理階段如何最佳利用這些信息仍是一個未解之題。在本研究中,我們揭示了這些模型的一個有趣特性:在文本數據上訓練的dLLMs隱含地學習了一組半自迴歸專家的混合體,其中不同的生成順序展現出不同的專業行為。我們指出,常見的固定推理時間調度做法,因未能利用這一潛在的集成,導致性能下降。為解決此問題,我們引入了HEX(隱藏半自迴歸專家用於測試時擴展),這是一種無需額外訓練的推理方法,它通過異構塊調度進行集成。通過對多樣化塊大小生成路徑進行多數表決,HEX穩健地避免了與任何單一固定調度相關的失敗模式。在如GSM8K等推理基準測試中,它將準確率提升至多3.56倍(從24.72%提升至88.10%),超越了Top-K邊際推理及如GRPO等專門微調方法,且無需額外訓練。HEX還在MATH基準測試中從16.40%提升至40.00%,在ARC-C的科學推理中從54.18%提升至87.80%,以及在TruthfulQA中從28.36%提升至57.46%,均取得了顯著進步。我們的成果為基於擴散的LLMs(dLLMs)的測試時擴展確立了新的範式,揭示了掩碼操作的順序在推理過程中對性能起著決定性作用。
English
Diffusion-based large language models (dLLMs) are trained flexibly to model
extreme dependence in the data distribution; however, how to best utilize this
information at inference time remains an open problem. In this work, we uncover
an interesting property of these models: dLLMs trained on textual data
implicitly learn a mixture of semi-autoregressive experts, where different
generation orders reveal different specialized behaviors. We show that
committing to any single, fixed inference time schedule, a common practice,
collapses performance by failing to leverage this latent ensemble. To address
this, we introduce HEX (Hidden semiautoregressive EXperts for test-time
scaling), a training-free inference method that ensembles across heterogeneous
block schedules. By doing a majority vote over diverse block-sized generation
paths, HEX robustly avoids failure modes associated with any single fixed
schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to
3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and
specialized fine-tuned methods like GRPO, without additional training. HEX even
yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific
reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%.
Our results establish a new paradigm for test-time scaling in diffusion-based
LLMs (dLLMs), revealing that the sequence in which masking is performed plays a
critical role in determining performance during inference.