基于图的思维链剪枝：减少推理大模型中冗余反思

摘要

通过强化学习扩展思维链技术已被广泛用于增强大语言模型的推理能力。然而，由于奖励信号的稀疏性，这种方法也可能引发不良思维模式，例如过度思考——即生成冗余的中间推理内容。本文指出，此类冗余的主要来源是低效反思，通常表现为两种问题模式： indiscriminate reflection（ indiscriminate reflection）指模型在推理过程中进行广泛但低效的检查，repetitive reflection（ repetitive reflection）则指模型对已确立的结论进行反复验证。针对这一问题，我们提出了一种基于图的思维链优化框架。具体而言，我们将线性思维链转换为带有显式依赖边的有向无环图，并设计双重剪枝策略：分支级剪枝剔除贡献较弱的反思分支，深度级剪枝消除后期重复验证。我们通过三阶段流程蒸馏该行为：（1）使用监督微调在剪枝后的简洁轨迹上初始化策略；（2）通过直接偏好优化筛选正确但冗余更少的轨迹；（3）采用带长度惩罚的组策略优化联合优化答案正确性与效率。实验表明，该方法在保持或提升准确率的同时，将平均推理标记数量减少了42%。

English

Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42\% while maintaining or improving accuracy.