BEAVER：一种基于结构感知页面选择的免训练分层提示压缩方法

摘要

大型语言模型上下文窗口的指数级扩展虽然解锁了长文档理解能力，却引发了推理延迟和信息利用效率的严重瓶颈。现有压缩方法因采用激进的令牌剪枝策略，往往存在训练成本高昂或语义碎片化的问题。本文提出BEAVER这一无需训练的新型框架，将压缩机制从线性令牌删除转向结构感知的层次化选择。该框架通过双路径池化将变长上下文映射为稠密的页级张量以最大化硬件并行度，并采用融合语义与词汇双分支选择的混合规划器，结合语句平滑技术保障语篇完整性。在四个长上下文基准测试上的广泛评估表明，BEAVER达到了与LongLLMLingua等前沿方法相当的性能。值得注意的是，在RULER基准测试中，当基线模型表现恶化时，BEAVER仍能保持多针检索任务的高保真度。效率方面，BEAVER在128k上下文长度上实现26.4倍的延迟降低，为高吞吐量应用提供了可扩展解决方案。代码已开源：https://cslikai.cn/BEAVER/。

English

The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.

BEAVER：一种基于结构感知页面选择的免训练分层提示压缩方法

BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

摘要

Support