后训练中输出多样性在何处崩溃？

摘要

后训练语言模型产生的输出多样性低于其基础模型。这种输出多样性崩溃削弱了依赖多样化样本的推理时扩展方法，并可能在创意性和价值负载任务上导致模型输出同质化。先前研究将崩溃归因于特定后训练方法，但未区分训练数据构成与方法的作用，也未分离生成格式与模型权重的影响。我们通过三条并行后训练路径（Olmo 3的Think路径——思维链蒸馏、Instruct路径——广谱多源数据、RL-Zero路径），在15个任务和四种文本多样性指标上追踪输出多样性变化。研究发现崩溃位置与数据构成存在共变关系：Think路径在监督微调阶段损失最多语义多样性，且DPO在Instruct路径中的影响大于Think路径。在Think模型中抑制推理时的思维链思考会降低困难任务的准确率，但答案级多样性保持不变，表明崩溃由训练数据嵌入模型权重而非生成格式导致。通过将六个可验证任务的多样性损失分解为质量控制成分（剔除错误输出）和残差成分（正确答案间的真实收窄），发现这种分解具有任务依赖性，且Think模型尽管总体崩溃更严重，但比Instruct模型保留了更多正确答案多样性。我们的结果表明，多样性崩溃由训练期间的数据构成决定，无法仅通过推理时干预解决。

English

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.