何种推理轨迹能更有效地提升学生的推理能力?一种简明的信息对齐度量方法
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
January 20, 2026
作者: Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI
摘要
长链思维轨迹(CoT)为从教师大语言模型向学生模型蒸馏推理能力提供了丰富的监督信号。然而,先前研究及我们的实验均表明,更强教师生成的轨迹未必能培养出更优秀的学生模型,这凸显了数据-学生适配性在蒸馏过程中的重要性。现有方法主要通过学生似然度评估适配性,倾向于选择与模型当前行为高度吻合的轨迹,却忽略了信息量更丰富的样本。针对此问题,我们提出排序-惊异值比率(RSR)这一简洁指标,它能同时捕捉对齐度和信息量来评估推理轨迹的适配性。RSR的提出基于关键发现:有效轨迹通常兼具较低绝对概率与学生模型下相对较高的词元排序,从而平衡学习信号强度与行为对齐度。具体而言,RSR定义为轨迹的平均词元排序与平均负对数似然之比,具有计算直观、解释性强的特点。在五个学生模型与来自11个不同教师的推理轨迹上的实验表明,RSR与训练后性能呈强相关性(平均斯皮尔曼系数0.86),优于现有指标。我们进一步验证了其在轨迹选择和教师选择两个场景中的实用价值。
English
Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.