BEAVER：一种基于结构感知页面选择的免训练分层提示压缩方法

摘要

大语言模型上下文窗口的指数级扩展虽解锁了长文档理解能力，却引发了推理延迟与信息利用率的严重瓶颈。现有压缩方法因激进的分词剪枝往往面临高训练成本或语义碎片化问题。本文提出BEAVER——一种无需训练的新型框架，将压缩机制从线性分词删除转向结构感知的层次化选择。该框架通过双路径池化将变长上下文映射为稠密页级张量以最大化硬件并行性，并采用融合语义与词汇双分支选择的混合规划器，结合语句平滑技术保持语篇完整性。在四个长上下文基准上的广泛实验表明，BEAVER达到了与LongLLMLingua等前沿方法相当的性能。尤其在RULER基准测试中，当基线方法性能退化时，BEAVER仍能保持多针检索的高保真度。效率方面，BEAVER在128k上下文场景下将延迟降低26.4倍，为高吞吐应用提供了可扩展方案。代码已开源：https://cslikai.cn/BEAVER/。

English

The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.