通过步骤熵压缩大语言模型中的思维链

摘要

采用思维链（CoT）提示的大型语言模型（LLMs）在复杂推理任务中表现出色，但生成的思维过程冗长且存在大量冗余，导致推理成本增加和效率降低。我们引入了一种基于步骤熵的新型CoT压缩框架，该指标量化了单个推理步骤对整体信息贡献的程度，从而识别冗余。通过理论分析及在数学推理基准上的广泛实证验证，我们证明了低熵步骤确实高度冗余。实验表明，在DeepSeek-R1-7B、14B和Qwen3-8B模型上，惊人地可以修剪掉80%的低熵中间步骤，而对最终答案准确性的影响微乎其微。这一发现与随机或高熵修剪形成鲜明对比，后者会严重损害推理性能。基于此，我们提出了一种结合监督微调（SFT）和群体相对策略优化（GRPO）强化学习的两阶段训练策略。该方法通过策略性地引入[SKIP]标记，使LLMs能够在推理过程中自主学习生成压缩的CoTs。我们的方法在严格保持准确性的同时，显著提升了LLM的推理效率，为LLM的实际部署提供了深远影响，并深化了对推理结构的理解。

English

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.

通过步骤熵压缩大语言模型中的思维链

Compressing Chain-of-Thought in LLMs via Step Entropy

摘要

Support