BEAVER: Een Trainingsvrije Hiërarchische Promptcompressiemethode via Structuurbewuste Paginaselectie

Samenvatting

De exponentiële uitbreiding van contextvensters in LLM's heeft mogelijkheden voor het begrijpen van lange documenten ontsloten, maar heeft ook ernstige knelpunten geïntroduceerd in de inferentielatentie en het informatiegebruik. Bestaande compressiemethoden kampen vaak met hoge trainingskosten of semantische fragmentatie als gevolg van agressieve tokenverwijdering. In dit artikel stellen we BEAVER voor, een nieuw trainingsvrij raamwerk dat compressie verschuift van lineaire tokenverwijdering naar structuurbewuste hiërarchische selectie. BEAVER maximaliseert hardwareparallelisme door contexten met variabele lengte af te beelden naar dichte pagina-level tensoren via dual-path pooling, en behoudt discoursintegriteit door een hybride planner die semantische en lexicale dual-branch selectie combineert met zinsafvlakking. Uitgebreide evaluaties op vier lange-context benchmarks tonen aan dat BEAVER vergelijkbare prestaties bereikt als state-of-the-art (SOTA) methoden zoals LongLLMLingua. Opmerkelijk is dat BEAVER op de RULER-benchmark een hoge nauwkeurigheid behoudt bij multi-needle retrieval, waar baseline-methoden verslechteren. Wat efficiëntie betreft, reduceert BEAVER de latentie met 26,4x op 128k contexten, en biedt zo een schaalbare oplossing voor high-throughput toepassingen. Onze code is beschikbaar op https://cslikai.cn/BEAVER/.

English

The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.

BEAVER: Een Trainingsvrije Hiërarchische Promptcompressiemethode via Structuurbewuste Paginaselectie

BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

Samenvatting

Support