解碼作為概率單純形上的優化：從Top-K到Top-P（核採樣）再到Best-of-K採樣器

摘要

解碼技術介於語言模型與其所有應用之間，卻仍被視為啟發式的參數調節手段。我們主張解碼應被理解為具理論基礎的優化層：在每個詞元生成時，我們在概率單形上解決正則化問題，權衡模型分數與結構性偏好及約束的平衡。這一統一框架將貪婪解碼、Softmax採樣、Top-K、Top-P及Sparsemax式稀疏性均視為特例，並通過最優性條件揭示其共同結構。更重要的是，該框架能擺脫經驗主義束縛，輕鬆設計新型解碼器。我們據此設計出Best-of-K（BoK）——一種針對多樣本流程（自洽性驗證、重排序、驗證器選擇）的KL錨定覆蓋目標。BoK以固定K樣本預算內覆蓋優質候選的概率為目標，提升實證表現。實驗表明該方法能顯著提升準確率，例如在高採樣溫度下，Qwen2.5-Math-7B模型在MATH500數據集上的準確率提升達+18.6%。

English

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.

解碼作為概率單純形上的優化：從Top-K到Top-P（核採樣）再到Best-of-K採樣器

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

摘要

Support