简洁推理,显著提升:通过难度感知提示修剪长推理轨迹
Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting
May 26, 2025
作者: Yifan Wu, Jingze Shi, Bingheng Wu, Jiayi Zhang, Xiaotian Lin, Nan Tang, Yuyu Luo
cs.AI
摘要
现有的思维链(CoT)蒸馏方法虽能有效将推理能力迁移至基础模型,却面临两大局限:推理轨迹过于冗长及对问题难度的适应性不足。冗长的推理轨迹显著增加了推理成本,而统一长度的解决方案阻碍了基础模型学习适应性推理策略。为解决这些问题,我们提出了一种难度感知提示(DAP)方法,旨在动态缩短推理轨迹而不损失性能。在该方法中,大型教师模型首先评估每个问题的难度,随后将其推理轨迹重写至适当较短的长度,生成简洁而完整的推理轨迹。借助DAP流程,我们构建了一个名为LiteCoT的蒸馏数据集,包含10万条简洁推理示例,其解决方案平均仅720个token(比典型CoT短一个数量级)。利用LiteCoT,我们基于Qwen2.5架构蒸馏出了一系列新的推理模型,称为Liter(1.5B、7B和32B)。实验表明,仅用10万条经过难度筛选的CoT样本微调的学生模型,其表现优于使用80万条原始长CoT样本蒸馏的模型,同时显著降低了训练和推理成本。我们的方法还展现出良好的泛化能力:在11个多样化基准测试中,较短的难度感知CoT在准确率上达到或超越了长链推理,且使用的token数量大幅减少。例如,在具有挑战性的AIME24考试中,我们的方法仅消耗约5K推理token便实现了74.2%的Pass@1,超越了消耗更多token的其他方法。我们的代码和数据可在https://github.com/Evanwu1125/LiteCoT获取。
English
Existing chain-of-thought (CoT) distillation methods can effectively transfer
reasoning abilities to base models but suffer from two major limitations:
excessive verbosity of reasoning traces and inadequate adaptability to problem
difficulty. Long reasoning traces significantly increase inference costs, and
uniform-length solutions prevent base models from learning adaptive reasoning
strategies. To address these issues, we propose a difficulty-aware prompting
(DAP) method to dynamically shorten reasoning traces without performance loss.
In our approach, a large teacher model first judges each problem's difficulty
and then rewrites its reasoning traces to an appropriate shorter length,
yielding concise yet complete reasoning traces. Leveraging the DAP pipeline, we
curate a distilled dataset called LiteCoT consisting of 100K concise reasoning
examples, with solutions averaging only 720 tokens (an order of magnitude
shorter than typical CoTs). Using LiteCoT, we distilled a new family of
reasoning models called Liter (1.5B, 7B, and 32B) based on the Qwen2.5
architecture. Experiments show that a student model fine-tuned on just 100K of
these difficulty-pruned CoT samples outperforms a model distilled on 800K
original Long CoT samples, while significantly reducing training and inference
costs. Our method also generalizes well: across 11 diverse benchmarks, the
shorter difficulty-aware CoTs achieve equal or better accuracy than Long
chains, using far fewer tokens. For example, on the challenging AIME24 exam,
our approach reaches 74.2% Pass@1 using only about 5K inference tokens,
surpassing other methods that consume many more tokens. Our code and data are
available at https://github.com/Evanwu1125/LiteCoT.Summary
AI-Generated Summary