推理路径压缩:压缩生成轨迹以实现高效大语言模型推理
Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
May 20, 2025
作者: Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim
cs.AI
摘要
近期,专注于推理的语言模型通过生成冗长的中间推理路径来获得高准确率。尽管这种方法在解决需要逻辑思维的问题上效果显著,但过长的推理路径显著增加了内存使用量和令牌生成的吞吐量,限制了此类模型的实际部署。我们提出了推理路径压缩(RPC),一种无需训练的方法,通过利用推理路径的语义稀疏性来加速推理过程。RPC定期压缩键值(KV)缓存,仅保留那些获得高重要性评分的KV缓存,这些评分由最近生成的查询组成的选择器窗口计算得出。实验表明,与使用完整KV缓存的推理相比,RPC将QwQ-32B的生成吞吐量提升了最高1.60倍,在AIME 2024基准测试上的准确率仅下降1.2%。我们的研究结果表明,推理轨迹中的语义稀疏性可有效用于压缩,为高效部署推理型大语言模型提供了一条实用路径。我们的代码可在https://github.com/jiwonsong-dev/ReasoningPathCompression获取。
English
Recent reasoning-focused language models achieve high accuracy by generating
lengthy intermediate reasoning paths before producing final answers. While this
approach is effective in solving problems that require logical thinking, long
reasoning paths significantly increase memory usage and throughput of token
generation, limiting the practical deployment of such models. We propose
Reasoning Path Compression (RPC), a training-free method that accelerates
inference by leveraging the semantic sparsity of reasoning paths. RPC
periodically compresses the KV cache by retaining KV cache that receive high
importance score, which are computed using a selector window composed of
recently generated queries. Experiments show that RPC improves generation
throughput of QwQ-32B by up to 1.60times compared to the inference with full
KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our
findings demonstrate that semantic sparsity in reasoning traces can be
effectively exploited for compression, offering a practical path toward
efficient deployment of reasoning LLMs. Our code is available at
https://github.com/jiwonsong-dev/ReasoningPathCompression.Summary
AI-Generated Summary