通过隐藏半自回归专家实现扩散大语言模型的测试时缩放

摘要

基于扩散的大型语言模型（dLLMs）经过灵活训练，能够有效建模数据分布中的极端依赖性；然而，在推理阶段如何最佳利用这一信息仍是一个开放性问题。本研究中，我们揭示了这些模型的一个有趣特性：在文本数据上训练的dLLMs隐式学习了一组半自回归专家的混合体，不同的生成顺序展现出不同的专门化行为。我们发现，采用单一固定的推理时间调度这一常见做法，因未能利用这一潜在集成，导致性能显著下降。为此，我们提出了HEX（用于测试时扩展的隐藏半自回归专家集成），一种无需额外训练的推理方法，通过跨异构块调度进行集成。通过对多样块大小生成路径进行多数表决，HEX稳健地避免了与任何单一固定调度相关的失败模式。在GSM8K等推理基准测试中，HEX将准确率提升高达3.56倍（从24.72%提升至88.10%），超越了Top-K边缘推理及如GRPO等专门微调方法，且无需额外训练。HEX还在MATH基准上实现了从16.40%到40.00%的显著提升，在ARC-C的科学推理任务中从54.18%提升至87.80%，在TruthfulQA上从28.36%提升至57.46%。我们的研究结果为基于扩散的LLMs（dLLMs）的测试时扩展确立了新范式，揭示了掩码执行顺序在推理性能中扮演的关键角色。

English

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

通过隐藏半自回归专家实现扩散大语言模型的测试时缩放

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

摘要

Support