QwenLong-CPRS: 동적 컨텍스트 최적화를 통한 무한-LLM을 향하여

초록

본 기술 보고서는 QwenLong-CPRS를 소개한다. 이는 명시적인 장문맥 최적화를 위해 설계된 컨텍스트 압축 프레임워크로, 프리필 단계에서 발생하는 과도한 계산 오버헤드와 장문 시퀀스 처리 시 대형 언어 모델(LLM)의 "중간에서 길을 잃는" 성능 저하 문제를 해결한다. QwenLong-CPRS는 새로운 동적 컨텍스트 최적화 메커니즘을 통해 구현되며, 자연어 지시에 따른 다중 세분화 컨텍스트 압축을 가능하게 하여 효율성 향상과 성능 개선을 동시에 달성한다. Qwen 아키텍처 시리즈에서 진화한 QwenLong-CPRS는 네 가지 주요 혁신을 도입한다: (1) 자연어 지시 기반 동적 최적화, (2) 경계 인식 강화를 위한 양방향 추론 레이어, (3) 언어 모델링 헤드가 포함된 토큰 비평 메커니즘, (4) 윈도우 병렬 추론. 5개 벤치마크(4K-2M 단어 컨텍스트)에 걸친 포괄적인 평가를 통해 QwenLong-CPRS의 세 가지 효과를 입증했다: (1) RAG 및 희소 주의력과 같은 다른 컨텍스트 관리 방법보다 정확도와 효율성 모두에서 일관된 우수성. (2) GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, Qwen2.5-max를 포함한 모든 주요 LLM과의 아키텍처 독립적 통합으로 21.59배의 컨텍스트 압축과 19.15포인트의 평균 성능 향상 달성. (3) Qwen2.5-32B-Instruct와 함께 배포 시, QwenLong-CPRS는 Ruler-128K와 InfiniteBench에서 선도적인 독점 LLM을 각각 4.85포인트와 10.88포인트 앞서며 새로운 SOTA 성능을 확립했다.

English

This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59times context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.

QwenLong-CPRS: 동적 컨텍스트 최적화를 통한 무한-LLM을 향하여

QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization

초록

Support