은닉 반자기회귀 전문가를 통한 확산 LLM의 테스트 시간 스케일링

초록

확산 기반 대형 언어 모델(dLLMs)은 데이터 분포 내의 극단적인 의존성을 유연하게 모델링하도록 학습되지만, 추론 시점에 이 정보를 최적으로 활용하는 방법은 여전히 미해결 문제로 남아 있습니다. 본 연구에서 우리는 이러한 모델의 흥미로운 특성을 발견했습니다: 텍스트 데이터로 학습된 dLLMs는 암묵적으로 준자기회귀 전문가들의 혼합을 학습하며, 서로 다른 생성 순서가 각기 다른 특화된 행동 양상을 드러냅니다. 우리는 일반적으로 사용되는 단일 고정 추론 스케줄에 의존하는 것이 이 잠재 앙상블을 활용하지 못해 성능을 저하시킨다는 것을 보여줍니다. 이를 해결하기 위해, 우리는 HEX(테스트 시점 스케일링을 위한 숨겨진 준자기회귀 전문가)라는 훈련이 필요 없는 추론 방법을 제안합니다. HEX는 다양한 블록 크기의 생성 경로에 대해 다수결 투표를 수행함으로써, 단일 고정 스케줄과 관련된 실패 모드를 견고하게 피합니다. GSM8K와 같은 추론 벤치마크에서 HEX는 정확도를 최대 3.56배(24.72%에서 88.10%로) 향상시키며, top-K 마진 추론 및 GRPO와 같은 특화된 미세 조정 방법을 추가 훈련 없이 능가합니다. HEX는 MATH 벤치마크에서도 16.40%에서 40.00%로, ARC-C의 과학적 추론에서 54.18%에서 87.80%로, TruthfulQA에서 28.36%에서 57.46%로 상당한 성능 향상을 보여줍니다. 우리의 결과는 확산 기반 LLMs(dLLMs)에서 테스트 시점 스케일링을 위한 새로운 패러다임을 제시하며, 마스킹이 수행되는 순서가 추론 중 성능을 결정하는 데 중요한 역할을 한다는 것을 밝혀냅니다.

English

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

은닉 반자기회귀 전문가를 통한 확산 LLM의 테스트 시간 스케일링

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

초록

Support