透過步驟熵壓縮大型語言模型中的思維鏈

摘要

使用思維鏈（CoT）提示的大型語言模型（LLMs）在複雜推理任務中表現出色，但生成的思維過程冗長且存在大量冗余，導致推理成本增加和效率降低。我們提出了一種基於步驟熵的新型CoT壓縮框架，該指標量化了各個推理步驟的信息貢獻，以識別冗余。通過理論分析和在數學推理基準上的廣泛實證驗證，我們證明了低熵步驟確實具有高度冗余性。我們的實驗表明，在DeepSeek-R1-7B、14B和Qwen3-8B模型上，驚人的80%低熵中間步驟可以被修剪，而對最終答案準確性的影響微乎其微。這一發現與隨機或高熵修剪形成鮮明對比，後者嚴重損害了推理性能。基於此，我們提出了一種結合監督微調（SFT）和組相對策略優化（GRPO）強化學習的兩階段訓練策略。該方法使LLMs能夠在推理過程中通過策略性地引入[SKIP]標記，自主學習生成壓縮的CoT。我們的方法顯著提高了LLM的推理效率，同時嚴格保持了準確性，為LLM的實際部署提供了深遠的影響，並加深了對推理結構的理解。

English

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.

透過步驟熵壓縮大型語言模型中的思維鏈

Compressing Chain-of-Thought in LLMs via Step Entropy

摘要

Support