ChatPaper.aiChatPaper

通过隐藏半自回归专家实现扩散大语言模型的测试时缩放

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

October 6, 2025
作者: Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
cs.AI

摘要

基于扩散的大型语言模型(dLLMs)经过灵活训练,能够有效建模数据分布中的极端依赖性;然而,在推理阶段如何最佳利用这一信息仍是一个开放性问题。本研究中,我们揭示了这些模型的一个有趣特性:在文本数据上训练的dLLMs隐式学习了一组半自回归专家的混合体,不同的生成顺序展现出不同的专门化行为。我们发现,采用单一固定的推理时间调度这一常见做法,因未能利用这一潜在集成,导致性能显著下降。为此,我们提出了HEX(用于测试时扩展的隐藏半自回归专家集成),一种无需额外训练的推理方法,通过跨异构块调度进行集成。通过对多样块大小生成路径进行多数表决,HEX稳健地避免了与任何单一固定调度相关的失败模式。在GSM8K等推理基准测试中,HEX将准确率提升高达3.56倍(从24.72%提升至88.10%),超越了Top-K边缘推理及如GRPO等专门微调方法,且无需额外训练。HEX还在MATH基准上实现了从16.40%到40.00%的显著提升,在ARC-C的科学推理任务中从54.18%提升至87.80%,在TruthfulQA上从28.36%提升至57.46%。我们的研究结果为基于扩散的LLMs(dLLMs)的测试时扩展确立了新范式,揭示了掩码执行顺序在推理性能中扮演的关键角色。
English
Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.
PDF22October 7, 2025