LongAttnComp: クロスファミリーコンテキスト圧縮による長文脈推論

要旨

実世界のアプリケーションにおいて10万トークン以上の入力を処理する必要性が増すにつれ、コンテキスト長と推論効率のギャップは重要なボトルネックとなっている。コンテキスト圧縮は、タスク精度を維持しつつプリフィルコストを削減する方法を提供する。しかし、既存の学習不要なアテンションベース手法では、コード推論などの要求の厳しい長文コンテキストタスクにおいて、大きな課題が残されている。本稿では、AttnCompの長文コンテキスト向け適応であるLongAttnCompを提案する。これは、軽量なクロスアテンションスコアリング層を微調整し、トークンレベルのチャンキング、トークンバジェットtop-pアルゴリズム、位置の再順序付け、形式に依存しないクエリパーサを導入する。さらに、圧縮器向けに2段階の微調整レシピを設計する。ステージ1ではNIAHスタイルのデータから汎用的な検索基盤を構築し、ステージ2ではマルチホップおよび推論データを追加して、より広範な長文コンテキストタスクをカバーする。InfiniteBench Code-Debugにおいて、LongAttnCompはフルコンテキスト精度と同等またはそれを上回り、学習不要ベースラインを大幅に上回り、3ファミリーの4つのターゲットモデル間で転移可能である。LongBench v2では、2段階レシピによりマルチ文書推論におけるステージ1のギャップを大幅に縮小しつつ、Code-Debugの性能を維持する。

English

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.