추론 경로 압축: 효율적인 LLM 추론을 위한 생성 궤적 압축

초록

최근 추론 중심 언어 모델들은 최종 답변을 생성하기 전에 긴 중간 추론 경로를 생성함으로써 높은 정확도를 달성하고 있습니다. 이러한 접근 방식은 논리적 사고가 필요한 문제를 해결하는 데 효과적이지만, 긴 추론 경로는 메모리 사용량과 토큰 생성 처리량을 크게 증가시켜, 이러한 모델의 실제 배포를 제한합니다. 우리는 추론 경로의 의미적 희소성을 활용하여 추론 속도를 높이는 학습이 필요 없는 방법인 Reasoning Path Compression (RPC)을 제안합니다. RPC는 최근 생성된 쿼리로 구성된 선택기 창을 사용하여 계산된 높은 중요도 점수를 받은 KV 캐시를 유지함으로써 주기적으로 KV 캐시를 압축합니다. 실험 결과, RPC는 전체 KV 캐시를 사용한 추론에 비해 QwQ-32B의 생성 처리량을 최대 1.60배 향상시키며, AIME 2024 벤치마크에서 1.2%의 정확도 하락을 보였습니다. 우리의 연구 결과는 추론 흔적에서의 의미적 희소성이 압축에 효과적으로 활용될 수 있음을 보여주며, 추론 중심 대형 언어 모델의 효율적인 배포를 위한 실용적인 방안을 제시합니다. 우리의 코드는 https://github.com/jiwonsong-dev/ReasoningPathCompression에서 확인할 수 있습니다.

English

Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60times compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

추론 경로 압축: 효율적인 LLM 추론을 위한 생성 궤적 압축

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

초록

Support