LongAttnComp: 교차 계열 컨텍스트 압축을 통한 장기 컨텍스트 추론

초록

실제 응용에서 10만 개 이상의 토큰을 입력으로 처리해야 하는 요구가 증가함에 따라, 컨텍스트 길이와 추론 효율성 간의 격차는 중요한 병목으로 부상하고 있다. 컨텍스트 압축은 작업 정확도를 유지하면서 프리필 비용을 줄이는 방법을 제공한다. 그러나 기존의 훈련 없는 어텐션 기반 방법들은 코드 추론과 같은 까다로운 장문 컨텍스트 작업에서 상당한 격차를 남긴다. 본 논문에서는 LongAttnComp를 제안한다. 이는 AttnComp를 장문 컨텍스트에 맞게 변형한 것으로, 경량의 교차 어텐션 점수 계산 층을 미세 조정하고, 토큰 수준 청킹, 토큰 예산 기반 top-p 알고리즘, 위치 재배열, 형식에 구애받지 않는 질의 파서를 도입한다. 또한 압축기를 위한 2단계 미세 조정 방법을 설계한다: 1단계는 NIAH 스타일 데이터로부터 일반적인 검색 기반을 구축하고, 2단계는 다중 홉 및 추론 데이터를 추가하여 더 넓은 장문 컨텍스트 작업 범위를 확장한다. InfiniteBench Code-Debug에서 LongAttnComp는 전체 컨텍스트 정확도에 필적하거나 이를 초과하며, 훈련 없는 기준선을 크게 능가하고, 세 가지 모델 패밀리의 네 가지 대상 모델에 걸쳐 전이된다. LongBench v2에서는 2단계 방법이 Code-Debug 성능을 유지하면서 다중 문서 추론에서 1단계의 격차를 대부분 해소한다.

English

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.