LongAttnComp：跨族系上下文壓縮用於長上下文推理

摘要

隨著實際應用越來越需要處理超過10萬個詞元的輸入，上下文長度與推論效率之間的差距已成為關鍵瓶頸。上下文壓縮提供了一種降低預填充成本同時保持任務準確性的方式。然而，現有的免訓練注意力方法在應對如程式碼推理等要求長上下文的任務時，仍存在明顯差距。我們提出LongAttnComp，這是一種將AttnComp應用於長上下文場景的改良方法：透過微調輕量級交叉注意力評分層，並引入詞元級分塊、詞元預算top-p演算法、位置重新排序，以及格式無關查詢解析器。我們進一步設計了兩階段微調策略來訓練壓縮器：第一階段利用NIAH風格的資料建立通用檢索基礎，第二階段則加入多跳與推理資料以擴展至更廣泛的長上下文任務。在InfiniteBench Code-Debug上，LongAttnComp達到或超越全上下文的準確率，顯著優於免訓練基準方法，並能在來自三個模型家族的四個目標模型間遷移。在LongBench v2上，兩階段策略大幅縮小了第一階段在多文件推理上的差距，同時維持了Code-Debug的表現。

English

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.