修剪不意外:基於首詞驚異度的高效代碼推理
Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal
August 8, 2025
作者: Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu
cs.AI
摘要
近期,大型推理模型(LRMs)通过扩展思维链(CoT)的长度,在代码推理方面展现了卓越的能力。然而,过长的推理轨迹在训练成本、推理延迟和部署可行性方面带来了重大挑战。尽管各种CoT压缩方法应运而生以应对这一挑战,但它们面临着固有的权衡:基于token级别的方法往往会破坏语法和逻辑连贯性,而基于困惑度的步骤级别方法则无法可靠地捕捉逻辑上关键的推理步骤。本文提出ASAP(锚点引导、基于意外度的剪枝),一种新颖的从粗到细的CoT压缩框架。ASAP首先执行锚点引导剪枝以保留核心推理结构,从而有效减少后续处理的搜索空间。随后,它通过基于新颖的首token意外度指标选择逻辑上必要的推理步骤,实现逻辑感知的剪枝。最后,ASAP教导模型在推理时自主生成并利用这些简洁的CoT,从而在编码任务中实现高效推理。实验表明,ASAP在多个代码生成基准测试中达到了最先进的准确率,同时大幅降低了训练和推理成本。在具有挑战性的LiveCodeBench v4_v5基准测试中,与最强基线相比,我们的方法减少了23.5%的token生成和43.5%的推理延迟,同时在Pass@1中实现了36.19%的竞争性准确率。我们的结果凸显了构建强大且高效的LRMs的一个有前景的方向。
English
Recently, Large Reasoning Models (LRMs) have demonstrated remarkable
capabilities in code reasoning by scaling up the length of Chain-of-Thought
(CoT). However, excessively long reasoning traces introduce substantial
challenges in terms of training cost, inference latency, and deployment
feasibility. While various CoT compression approaches have emerged to address
this challenge, they face inherent trade-offs: token-level methods often
disrupt syntactic and logical coherence, while step-level methods based on
perplexity fail to reliably capture the logically critical reasoning steps. In
this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel
coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided
pruning to preserve the core reasoning structure, which efficiently reduces the
search space for subsequent processing. It then enables a logic-aware pruning
by selecting logically essential reasoning steps based on a novel first-token
surprisal metric. Finally, ASAP teaches models to autonomously generate and
leverage these concise CoTs at inference time, enabling efficient reasoning in
coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy
across multiple code generation benchmarks while substantially reducing
training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark,
our approach reduces token generation by 23.5% and inference latency by 43.5%
compared to the strongest baseline, while achieving a competitive accuracy of
36.19% in Pass@1. Our results highlight a promising direction for building
powerful and efficient LRMs.