推理路径压缩：压缩生成轨迹以实现高效大语言模型推理

摘要

近期，专注于推理的语言模型通过生成冗长的中间推理路径来获得高准确率。尽管这种方法在解决需要逻辑思维的问题上效果显著，但过长的推理路径显著增加了内存使用量和令牌生成的吞吐量，限制了此类模型的实际部署。我们提出了推理路径压缩（RPC），一种无需训练的方法，通过利用推理路径的语义稀疏性来加速推理过程。RPC定期压缩键值（KV）缓存，仅保留那些获得高重要性评分的KV缓存，这些评分由最近生成的查询组成的选择器窗口计算得出。实验表明，与使用完整KV缓存的推理相比，RPC将QwQ-32B的生成吞吐量提升了最高1.60倍，在AIME 2024基准测试上的准确率仅下降1.2%。我们的研究结果表明，推理轨迹中的语义稀疏性可有效用于压缩，为高效部署推理型大语言模型提供了一条实用路径。我们的代码可在https://github.com/jiwonsong-dev/ReasoningPathCompression获取。

English

Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60times compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

推理路径压缩：压缩生成轨迹以实现高效大语言模型推理

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

摘要

Support