对齐如何缩小生成视野
How Alignment Shrinks the Generative Horizon
June 22, 2025
作者: Chenghao Yang, Ari Holtzman
cs.AI
摘要
尽管大型语言模型(LLMs)展现出令人瞩目的能力,但经过对齐的模型往往生成缺乏多样性的输出。是什么驱动了这种生成稳定性?我们通过模型输出分布中的概率集中现象来探究这一现象。为了量化这种集中程度,我们引入了分支因子(Branching Factor, BF)——一种在生成过程中衡量有效后续步骤数量的、与具体token无关的指标。我们的实证分析揭示了两项关键发现:(1)随着生成的进行,BF通常会降低,这表明LLMs在生成过程中变得更加可预测。(2)对齐调优从一开始就显著锐化了模型的输出分布,使BF相对于基础模型降低了近一个数量级(例如,从12降至1.2)。这一显著降低有助于解释为何对齐模型往往对解码策略不那么敏感。基于这一洞察,我们发现这种稳定性对复杂推理有着出人意料的影响。例如,经过对齐的思维链(Chain-of-Thought, CoT)模型(如DeepSeek蒸馏模型)就利用了这一点;通过生成更长的推理链,它们将生成过程推向后期更为确定(BF更低)的阶段,从而产生更稳定的输出。我们假设,对齐调优并未从根本上改变模型的行为,而是引导其朝向风格化token(如“当然”),这些token解锁了基础模型中已然存在的低熵轨迹。这一观点得到了提示实验的支持,实验表明,用此类token提示基础模型同样能降低BF。综合来看,我们的研究确立了BF作为理解和控制LLM输出的强大诊断工具——阐明了对齐如何减少变异性、CoT如何促进稳定生成,以及如何引导基础模型远离多样性。
English
Despite their impressive capabilities, aligned large language models (LLMs)
often generate outputs that lack diversity. What drives this stability in the
generation? We investigate this phenomenon through the lens of probability
concentration in the model's output distribution. To quantify this
concentration, we introduce the Branching Factor (BF) -- a token-invariant
measure of the effective number of plausible next steps during generation. Our
empirical analysis reveals two key findings: (1) BF often decreases as
generation progresses, suggesting that LLMs become more predictable as they
generate. (2) alignment tuning substantially sharpens the model's output
distribution from the outset, reducing BF by nearly an order of magnitude
(e.g., from 12 to 1.2) relative to base models. This stark reduction helps
explain why aligned models often appear less sensitive to decoding strategies.
Building on this insight, we find this stability has surprising implications
for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g.,
DeepSeek-distilled models), for instance, leverage this effect; by generating
longer reasoning chains, they push generation into later, more deterministic
(lower BF) stages, resulting in more stable outputs. We hypothesize that
alignment tuning does not fundamentally change a model's behavior, but instead
steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy
trajectories already present in the base model. This view is supported by
nudging experiments, which show that prompting base models with such tokens can
similarly reduce BF. Together, our findings establish BF as a powerful
diagnostic for understanding and controlling LLM outputs - clarifying how
alignment reduces variability, how CoT promotes stable generations, and how
base models can be steered away from diversity.