从说教到建构:将专家解决方案转化为可习得的推理过程
Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning
February 2, 2026
作者: Ethan Mendes, Jungsoo Park, Alan Ritter
cs.AI
摘要
提升大语言模型(LLM)的推理能力通常依赖于两种途径:要么依靠模型自身采样出可强化的正确解法,要么借助能解决该问题的更强模型。然而,即使对当前最先进的模型而言,许多难题仍然无法攻克,导致难以提取有效的训练信号。一个可行的替代方案是利用高质量的人类专家解法,但直接模仿这类数据往往收效甚微,因为其本质属于分布外数据:专家解法通常具有教学性,包含面向人类读者而非计算模型设计的隐性推理跳跃。此外,高质量专家解法成本高昂,需要开发具有泛化能力的高样本效率训练方法。我们提出分布对齐模仿学习(DAIL),该方法通过两个步骤弥合分布差距:首先将专家解法转化为符合分布特性的详细推理轨迹,再应用对比学习目标使模型聚焦于专家的思维路径和方法论。实验表明,DAIL仅需不到1000个高质量专家解法即可在Qwen2.5-Instruct和Qwen3模型上实现10-25%的pass@k提升,将推理效率提高2至4倍,并具备跨领域泛化能力。
English
Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct solution to be reinforced or on the existence of a stronger model able to solve the problem. However, many difficult problems remain intractable for even current frontier models, preventing the extraction of valid training signals. A promising alternative is to leverage high-quality expert human solutions, yet naive imitation of this data fails because it is fundamentally out of distribution: expert solutions are typically didactic, containing implicit reasoning gaps intended for human readers rather than computational models. Furthermore, high-quality expert solutions are expensive, necessitating generalizable sample-efficient training methods. We propose Distribution Aligned Imitation Learning (DAIL), a two-step method that bridges the distributional gap by first transforming expert solutions into detailed, in-distribution reasoning traces and then applying a contrastive objective to focus learning on expert insights and methodologies. We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve 10-25% pass@k gains on Qwen2.5-Instruct and Qwen3 models, improve reasoning efficiency by 2x to 4x, and enable out-of-domain generalization.