LongCodeZip：面向代码语言模型的长上下文压缩技术

摘要

在大型语言模型（LLMs）需要处理代码库中大量信息的背景下，长上下文代码生成变得日益关键。尽管近期进展使得代码LLMs能够处理长输入，但高昂的API成本和生成延迟仍是主要瓶颈。现有的上下文剪枝技术，如LLMLingua，在通用文本上取得了显著成果，却忽视了代码特有的结构和依赖关系，导致在编程任务中表现欠佳。本文提出LongCodeZip，一种专为代码LLMs设计的新型即插即用代码压缩框架。LongCodeZip采用双阶段策略：（1）粗粒度压缩，通过条件困惑度识别并排序函数级代码块，仅保留与指令最相关的函数；（2）细粒度压缩，将保留的函数基于困惑度分割成块，并在自适应令牌预算下选择最优子集以最大化相关性。在包括代码补全、摘要和问答在内的多项任务评估中，LongCodeZip均优于基线方法，实现了高达5.6倍的压缩比且不降低任务性能。通过有效缩减上下文规模同时保留关键信息，LongCodeZip使LLMs能更好地适应现实世界的大规模代码场景，提升了代码智能应用的效率和能力。

English

Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.