신뢰도가 오도할 때: 확산 언어 모델을 위한 접미사 앵커링 및 앵커 근접도 신뢰도 조정

초록

확산 언어 모델은 마스크된 토큰 시퀀스를 반복적으로 디노이징하여 텍스트를 디코딩하므로, 디코딩할 위치를 선택하는 것이 추론 시점의 핵심 결정 사항이다. 대부분의 학습 비필요 디코딩 전략은 모델 신뢰도를 활용하여 위치를 선택하며, 높은 신뢰도를 가진 위치가 디코딩될 준비가 되었다고 가정한다. 본 연구에서는 이러한 가정을 재검토하여 신뢰도가 완전 비자기회귀(완전 비-AR) 디코딩을 오도하는 경우를 분석한다. EOT 토큰은 높은 신뢰도를 받아 불완전한 생성을 초래할 수 있으며, 접미사 앵커를 삽입하면 이 문제를 완화할 수 있지만 앵커 근처에서 국지적 과신뢰를 유발하여 앵커에 인접한 토큰이 너무 일찍 디코딩되게 만든다. 이러한 문제를 해결하기 위해 본 연구는 접미사 앵커 기반 신뢰도 변조(Suffix-Anchored Confidence Modulation)를 제안한다. 이는 간단한 학습 비필요 방법으로, 짧은 접미사 앵커를 삽입하여 응답 완성을 촉진하고 디코딩 진행 상황에 따라 앵커 근처의 신뢰도를 조절한다. 이를 통해 접미사 앵커의 응답 완성 이점을 유지하면서 앵커에 인접한 토큰의 조기 디코딩을 줄인다. 텍스트 전용 추론, 시각-언어 추론, 코드 생성 벤치마크에서 제안 방법은 신뢰도 기반 완전 비-AR 디코딩을 지속적으로 개선하고, 명시적 EOT 억제보다 우수한 성능을 보이며, 완전 비-AR 생성의 병렬 디코딩 이점을 유지한다.

English

Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.