解码作为概率单纯形上的优化：从Top-K到Top-P（核采样）再到Best-of-K采样器

摘要

解码技术介于语言模型与其所有应用之间，却仍被视作启发式的参数调优过程。我们认为解码应当被理解为一种原则化的优化层：在每个标记生成步骤中，我们在概率单纯形上求解一个正则化问题，以平衡模型得分与结构化偏好及约束。这一统一框架不仅将贪婪解码、Softmax采样、Top-K、Top-P及Sparsemax类稀疏方法收束为特例，更通过最优性条件揭示了它们的共性结构。更重要的是，该框架使得无需依赖经验法则即可设计新型解码器。我们通过设计Best-of-K（BoK）解码器验证了这一理念——这是一种针对多样本流程（自洽性校验、重排序、验证器选择）的KL锚定覆盖目标。BoK致力于在固定K样本预算内覆盖优质候选序列的概率，并提升了实证性能。实验表明，此类样本能显著提升准确率，例如在高温采样条件下，Qwen2.5-Math-7B模型在MATH500数据集上的表现提升了18.6%。

English

Decoding sits between a language model and everything we do with it, yet it is still treated as a heuristic knob-tuning exercise. We argue decoding should be understood as a principled optimisation layer: at each token, we solve a regularised problem over the probability simplex that trades off model score against structural preferences and constraints. This single template recovers greedy decoding, Softmax sampling, Top-K, Top-P, and Sparsemax-style sparsity as special cases, and explains their common structure through optimality conditions. More importantly, the framework makes it easy to invent new decoders without folklore. We demonstrate this by designing Best-of-K (BoK), a KL-anchored coverage objective aimed at multi-sample pipelines (self-consistency, reranking, verifier selection). BoK targets the probability of covering good alternatives within a fixed K-sample budget and improves empirical performance. We show that such samples can improve accuracy by, for example, +18.6% for Qwen2.5-Math-7B on MATH500 at high sampling temperatures.