拡散型大規模言語モデルにおけるテスト時スケーリング：隠れ半自己回帰エキスパートを介して

要旨

拡散ベースの大規模言語モデル（dLLM）は、データ分布における極端な依存性を柔軟にモデル化するように訓練されているが、推論時にこの情報を最適に活用する方法は未解決の問題である。本研究では、これらのモデルに興味深い特性があることを明らかにする：テキストデータで訓練されたdLLMは、半自己回帰的な専門家の混合を暗黙的に学習しており、異なる生成順序が異なる専門的な振る舞いを明らかにする。我々は、一般的な慣行である単一の固定された推論スケジュールに固執することが、この潜在的なアンサンブルを活用できずに性能を低下させることを示す。これに対処するため、我々はHEX（Hidden semiautoregressive EXperts for test-time scaling）を導入する。これは、異種のブロックスケジュールにわたってアンサンブルを行う訓練不要の推論手法である。多様なブロックサイズの生成パスに対して多数決を行うことで、HEXは単一の固定スケジュールに関連する失敗モードを堅牢に回避する。GSM8Kのような推論ベンチマークでは、精度を最大3.56倍（24.72%から88.10%へ）向上させ、トップKマージン推論やGRPOのような専門的な微調整手法を上回り、追加の訓練を必要としない。HEXは、MATHベンチマークでも16.40%から40.00%へ、ARC-Cでの科学的推論では54.18%から87.80%へ、TruthfulQAでは28.36%から57.46%へと、大幅な向上をもたらす。我々の結果は、拡散ベースのLLM（dLLM）におけるテストタイムスケーリングの新しいパラダイムを確立し、マスキングが行われる順序が推論時の性能を決定する上で重要な役割を果たすことを明らかにする。

English

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

拡散型大規模言語モデルにおけるテスト時スケーリング：隠れ半自己回帰エキスパートを介して

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

要旨

Support