推論パス圧縮：効率的なLLM推論のための生成軌跡の圧縮

要旨

最近の推論重視の言語モデルは、最終的な答えを生成する前に長い中間推論パスを生成することで高い精度を達成しています。このアプローチは論理的思考を必要とする問題を解決するのに効果的ですが、長い推論パスはメモリ使用量とトークン生成のスループットを大幅に増加させ、そのようなモデルの実用的な展開を制限しています。私たちは、推論パスの意味的スパース性を活用して推論を加速する、トレーニング不要の方法であるReasoning Path Compression（RPC）を提案します。RPCは、最近生成されたクエリで構成されるセレクターウィンドウを使用して計算された高い重要度スコアを受けるKVキャッシュを保持することで、定期的にKVキャッシュを圧縮します。実験では、RPCがQwQ-32Bの生成スループットを最大1.60倍向上させ、AIME 2024ベンチマークでの精度低下は1.2%であることが示されました。私たちの研究結果は、推論トレースにおける意味的スパース性を効果的に圧縮に利用できることを示しており、推論LLMの効率的な展開に向けた実用的な道を提供します。私たちのコードはhttps://github.com/jiwonsong-dev/ReasoningPathCompressionで利用可能です。

English

Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60times compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

推論パス圧縮：効率的なLLM推論のための生成軌跡の圧縮

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

要旨

Support