推理路徑壓縮：壓縮生成軌跡以實現高效的大型語言模型推理

摘要

近期專注於推理的語言模型通過生成冗長的中間推理路徑來達到高準確率。雖然這種方法在解決需要邏輯思考的問題上頗為有效，但過長的推理路徑顯著增加了記憶體使用量和令牌生成的吞吐量，限制了此類模型的實際部署。我們提出了推理路徑壓縮（Reasoning Path Compression, RPC），這是一種無需訓練的方法，利用推理路徑的語義稀疏性來加速推理。RPC定期壓縮鍵值（KV）快取，保留那些獲得高重要性分數的KV快取，這些分數是通過由最近生成的查詢組成的選擇器窗口計算得出的。實驗顯示，與使用完整KV快取的推理相比，RPC將QwQ-32B的生成吞吐量提升了最高1.60倍，在AIME 2024基準測試上的準確率僅下降1.2%。我們的研究表明，推理軌跡中的語義稀疏性可被有效利用於壓縮，為高效部署推理型大型語言模型提供了一條實用路徑。我們的程式碼可在https://github.com/jiwonsong-dev/ReasoningPathCompression 獲取。

English

Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60times compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

推理路徑壓縮：壓縮生成軌跡以實現高效的大型語言模型推理

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

摘要

Support