對齊如何縮小生成視野
How Alignment Shrinks the Generative Horizon
June 22, 2025
作者: Chenghao Yang, Ari Holtzman
cs.AI
摘要
儘管對齊的大型語言模型(LLMs)展現了令人印象深刻的能力,但其生成的輸出往往缺乏多樣性。是什麼驅動了這種生成過程中的穩定性?我們通過模型輸出分佈中的概率集中現象來探討這一現象。為了量化這種集中程度,我們引入了分支因子(Branching Factor, BF)——一種在生成過程中衡量有效可行後續步驟數量的不變量指標。我們的實證分析揭示了兩個關鍵發現:(1)隨著生成的進行,BF通常會降低,這表明LLMs在生成過程中變得更加可預測。(2)對齊調優從一開始就顯著銳化了模型的輸出分佈,使BF相較於基礎模型降低了近一個數量級(例如,從12降至1.2)。這一顯著的降低有助於解釋為何對齊模型對解碼策略的敏感性較低。基於這一洞察,我們發現這種穩定性對複雜推理具有意想不到的影響。例如,對齊的思維鏈(Chain-of-Thought, CoT)模型(如DeepSeek蒸餾模型)就利用了這一效應;通過生成更長的推理鏈,它們將生成推遲到後期、更具確定性(BF更低)的階段,從而產生更穩定的輸出。我們假設,對齊調優並未從根本上改變模型的行為,而是引導其朝向風格化標記(如“當然”),這些標記解鎖了基礎模型中已有的低熵軌跡。這一觀點得到了微調實驗的支持,實驗顯示,用此類標記提示基礎模型同樣可以降低BF。綜合來看,我們的研究確立了BF作為理解和控制LLM輸出的強大診斷工具——闡明瞭對齊如何減少變異性、CoT如何促進穩定生成,以及如何引導基礎模型遠離多樣性。
English
Despite their impressive capabilities, aligned large language models (LLMs)
often generate outputs that lack diversity. What drives this stability in the
generation? We investigate this phenomenon through the lens of probability
concentration in the model's output distribution. To quantify this
concentration, we introduce the Branching Factor (BF) -- a token-invariant
measure of the effective number of plausible next steps during generation. Our
empirical analysis reveals two key findings: (1) BF often decreases as
generation progresses, suggesting that LLMs become more predictable as they
generate. (2) alignment tuning substantially sharpens the model's output
distribution from the outset, reducing BF by nearly an order of magnitude
(e.g., from 12 to 1.2) relative to base models. This stark reduction helps
explain why aligned models often appear less sensitive to decoding strategies.
Building on this insight, we find this stability has surprising implications
for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g.,
DeepSeek-distilled models), for instance, leverage this effect; by generating
longer reasoning chains, they push generation into later, more deterministic
(lower BF) stages, resulting in more stable outputs. We hypothesize that
alignment tuning does not fundamentally change a model's behavior, but instead
steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy
trajectories already present in the base model. This view is supported by
nudging experiments, which show that prompting base models with such tokens can
similarly reduce BF. Together, our findings establish BF as a powerful
diagnostic for understanding and controlling LLM outputs - clarifying how
alignment reduces variability, how CoT promotes stable generations, and how
base models can be steered away from diversity.