后训练中输出多样性在何处崩溃?
Where does output diversity collapse in post-training?
April 17, 2026
作者: Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras
cs.AI
摘要
后训练语言模型产生的输出多样性低于其基础模型。这种输出多样性崩溃削弱了依赖多样化样本的推理时扩展方法,并可能在创意性和价值负载任务上导致模型输出同质化。先前研究将崩溃归因于特定后训练方法,但未区分训练数据构成与方法的作用,也未分离生成格式与模型权重的影响。我们通过三条并行后训练路径(Olmo 3的Think路径——思维链蒸馏、Instruct路径——广谱多源数据、RL-Zero路径),在15个任务和四种文本多样性指标上追踪输出多样性变化。研究发现崩溃位置与数据构成存在共变关系:Think路径在监督微调阶段损失最多语义多样性,且DPO在Instruct路径中的影响大于Think路径。在Think模型中抑制推理时的思维链思考会降低困难任务的准确率,但答案级多样性保持不变,表明崩溃由训练数据嵌入模型权重而非生成格式导致。通过将六个可验证任务的多样性损失分解为质量控制成分(剔除错误输出)和残差成分(正确答案间的真实收窄),发现这种分解具有任务依赖性,且Think模型尽管总体崩溃更严重,但比Instruct模型保留了更多正确答案多样性。我们的结果表明,多样性崩溃由训练期间的数据构成决定,无法仅通过推理时干预解决。
English
Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.