QwenLong-CPRS: 動的コンテキスト最適化による無限大LLMへのアプローチ

要旨

本技術レポートでは、明示的な長文脈最適化のために設計されたコンテキスト圧縮フレームワーク「QwenLong-CPRS」を紹介する。このフレームワークは、プリフィル段階での過剰な計算コストと、長文シーケンス処理における大規模言語モデル（LLM）の「中間消失」性能低下という課題に対処する。新しい動的コンテキスト最適化メカニズムを通じて実装されたQwenLong-CPRSは、自然言語指示に基づく多粒度のコンテキスト圧縮を可能にし、効率性の向上と性能改善の両方を実現する。 Qwenアーキテクチャシリーズから進化したQwenLong-CPRSは、以下の4つの主要な革新を導入している：(1) 自然言語誘導型動的最適化、(2) 境界認識を強化する双方向推論層、(3) 言語モデリングヘッドを備えたトークン批評メカニズム、(4) ウィンドウ並列推論。 5つのベンチマーク（4K-2M単語のコンテキスト）にわたる包括的な評価により、QwenLong-CPRSの3つの有効性が実証された：(1) RAGやスパースアテンションなどの他のコンテキスト管理手法と比較して、精度と効率の両面で一貫した優位性。(2) GPT-4o、Gemini2.0-pro、Claude3.7-sonnet、DeepSeek-v3、Qwen2.5-maxを含むすべての主要LLMとのアーキテクチャ非依存な統合により、21.59倍のコンテキスト圧縮と19.15ポイントの平均性能向上を達成。(3) Qwen2.5-32B-Instructと共にデプロイされたQwenLong-CPRSは、Ruler-128KとInfiniteBenchにおいて、主要なプロプライエタリLLMを4.85ポイントおよび10.88ポイント上回り、新たなSOTA性能を確立した。

English

This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59times context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.