확률 단체 최적화로서의 디코딩: Top-K에서 Top-P(핵심 샘플링) 그리고 Best-of-K 샘플러까지

초록

디코딩은 언어 모델과 이를 활용한 모든 작업 사이에 위치하지만, 여전히 경험적인 하이퍼파라미터 조정 작업으로 취급됩니다. 우리는 디코딩이 원칙 기반 최적화 계층으로 이해되어야 한다고 주장합니다. 각 토큰 단계에서 우리는 모델 점수와 구조적 선호도 및 제약 조건 사이의 균형을 맞추는, 확률 심플렉스(probability simplex) 상의 정규화된 문제를 해결합니다. 이 단일 템플릿은 탐욕 디코딩(greedy decoding), Softmax 샘플링, Top-K, Top-P, 그리고 Sparsemax 방식의 희소성(sparsity)을 특수 사례로 복원하며, 최적성 조건을 통해 이들의 공통 구조를 설명합니다. 더 중요한 것은, 이 프레임워크를 통해 통설(folklore)에 의존하지 않고 새로운 디코더를 쉽게 설계할 수 있다는 점입니다. 우리는 이를 다중 샘플 파이프라인(자기 일관성, 재순위 지정, 검증기 선택)을 목표로 KL 발산에 기반한 coverage 목적 함수인 Best-of-K(BoK)를 설계하여 입증합니다. BoK는 고정된 K-샘플 예산 내에서 우수한 대안을 포함할 확률을 목표로 하며 경험적 성능을 향상시킵니다. 우리는 이러한 샘플이 예를 들어, 높은 샘플링 temperature에서 MATH500에 대한 Qwen2.5-Math-7B의 정확도를 +18.6%까지 향상시킬 수 있음을 보여줍니다.

English

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

확률 단체 최적화로서의 디코딩: Top-K에서 Top-P(핵심 샘플링) 그리고 Best-of-K 샘플러까지

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

초록

Support