通过隐藏半自回归专家实现扩散大语言模型的测试时缩放
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
October 6, 2025
作者: Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
cs.AI
摘要
基于扩散的大型语言模型(dLLMs)经过灵活训练,能够有效建模数据分布中的极端依赖性;然而,在推理阶段如何最佳利用这一信息仍是一个开放性问题。本研究中,我们揭示了这些模型的一个有趣特性:在文本数据上训练的dLLMs隐式学习了一组半自回归专家的混合体,不同的生成顺序展现出不同的专门化行为。我们发现,采用单一固定的推理时间调度这一常见做法,因未能利用这一潜在集成,导致性能显著下降。为此,我们提出了HEX(用于测试时扩展的隐藏半自回归专家集成),一种无需额外训练的推理方法,通过跨异构块调度进行集成。通过对多样块大小生成路径进行多数表决,HEX稳健地避免了与任何单一固定调度相关的失败模式。在GSM8K等推理基准测试中,HEX将准确率提升高达3.56倍(从24.72%提升至88.10%),超越了Top-K边缘推理及如GRPO等专门微调方法,且无需额外训练。HEX还在MATH基准上实现了从16.40%到40.00%的显著提升,在ARC-C的科学推理任务中从54.18%提升至87.80%,在TruthfulQA上从28.36%提升至57.46%。我们的研究结果为基于扩散的LLMs(dLLMs)的测试时扩展确立了新范式,揭示了掩码执行顺序在推理性能中扮演的关键角色。
English
Diffusion-based large language models (dLLMs) are trained flexibly to model
extreme dependence in the data distribution; however, how to best utilize this
information at inference time remains an open problem. In this work, we uncover
an interesting property of these models: dLLMs trained on textual data
implicitly learn a mixture of semi-autoregressive experts, where different
generation orders reveal different specialized behaviors. We show that
committing to any single, fixed inference time schedule, a common practice,
collapses performance by failing to leverage this latent ensemble. To address
this, we introduce HEX (Hidden semiautoregressive EXperts for test-time
scaling), a training-free inference method that ensembles across heterogeneous
block schedules. By doing a majority vote over diverse block-sized generation
paths, HEX robustly avoids failure modes associated with any single fixed
schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to
3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and
specialized fine-tuned methods like GRPO, without additional training. HEX even
yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific
reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%.
Our results establish a new paradigm for test-time scaling in diffusion-based
LLMs (dLLMs), revealing that the sequence in which masking is performed plays a
critical role in determining performance during inference.