LongAttnComp：跨家族上下文压缩用于长上下文推理

摘要

随着实际应用日益需要处理超过10万令牌的输入，上下文长度与推理效率之间的差距已成为关键瓶颈。上下文压缩提供了一种在保持任务准确性的同时降低预填充成本的方法。然而，现有基于注意力机制的无训练方法在代码推理等长上下文任务中仍存在显著差距。我们提出LongAttnComp——一种针对长上下文场景适配的AttnComp变体，通过微调轻量级交叉注意力评分层，引入令牌级分块、令牌预算Top-P算法、位置重排序以及格式无关的查询解析器。我们进一步设计了压缩器的两阶段微调方案：第一阶段基于NIAH风格数据构建通用检索基础，第二阶段通过多跳推理数据拓展其长上下文任务覆盖范围。在InfiniteBench Code-Debug基准上，LongAttnComp在准确率上达到或超越全上下文方案，显著优于无训练基线方法，并能跨三个模型家族的四个目标模型进行迁移。在LongBench v2基准上，两阶段微调方案大幅缩小了第一阶段在多文档推理任务上的性能差距，同时保持了Code-Debug任务的效果。

English

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.