驚きの少ない部分を刈り取る：第一トークンの驚き度に基づく効率的なコード推論

要旨

近年、大規模推論モデル（LRMs）は、Chain-of-Thought（CoT）の長さをスケールアップすることで、コード推論において顕著な能力を発揮しています。しかし、過度に長い推論トレースは、トレーニングコスト、推論遅延、および展開の実現可能性において大きな課題を引き起こします。この課題に対処するために様々なCoT圧縮手法が登場していますが、それらには本質的なトレードオフが存在します：トークンレベルの手法はしばしば構文的および論理的一貫性を損ない、一方でパープレキシティに基づくステップレベルの手法は論理的に重要な推論ステップを確実に捕捉することができません。本論文では、CoT圧縮のための新しい粗密フレームワークであるASAP（Anchor-guided, Surprisal-based Pruning）を提案します。ASAPはまず、コア推論構造を保持するためにアンカーガイド付きプルーニングを実行し、後続処理のための探索空間を効率的に削減します。次に、新しいファーストトークンサプライズメトリックに基づいて論理的に不可欠な推論ステップを選択することで、論理を意識したプルーニングを可能にします。最後に、ASAPはモデルにこれらの簡潔なCoTを推論時に自律的に生成し活用することを教え、コーディングタスクにおける効率的な推論を実現します。実験結果は、ASAPが複数のコード生成ベンチマークにおいて最先端の精度を達成しつつ、トレーニングおよび推論コストを大幅に削減することを示しています。挑戦的なLiveCodeBench v4_v5ベンチマークにおいて、我々のアプローチは最強のベースラインと比較してトークン生成を23.5%、推論遅延を43.5%削減しつつ、Pass@1で36.19%の競争力のある精度を達成しました。我々の結果は、強力で効率的なLRMsを構築するための有望な方向性を示しています。

English

Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.

驚きの少ない部分を刈り取る：第一トークンの驚き度に基づく効率的なコード推論

Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal

要旨

Support