QwenLong-CPRS: Auf dem Weg zu infty-LLMs mit dynamischer Kontextoptimierung

papers.abstract

Dieser technische Bericht stellt QwenLong-CPRS vor, ein Kontextkompressionsframework, das für die explizite Optimierung langer Kontexte entwickelt wurde und dabei die prohibitiv hohen Rechenkosten während der Prefill-Phase sowie die Leistungsverschlechterung durch den "Lost-in-the-Middle"-Effekt bei der Verarbeitung langer Sequenzen durch große Sprachmodelle (LLMs) adressiert. Durch einen neuartigen Mechanismus zur dynamischen Kontextoptimierung implementiert, ermöglicht QwenLong-CPRS eine mehrgranulare Kontextkompression, die durch natürliche Sprachinstruktionen gesteuert wird, wodurch sowohl Effizienzgewinne als auch verbesserte Leistung erzielt werden. Aus der Qwen-Architekturreihe hervorgegangen, führt QwenLong-CPRS vier Schlüsselinnovationen ein: (1) Natürliche Sprachgesteuerte dynamische Optimierung, (2) Bidirektionale Reasoning-Schichten für ein verbessertes Grenzbewusstsein, (3) Token-Kritik-Mechanismen mit Sprachmodellierungs-Köpfen und (4) Fensterparallele Inferenz. Umfassende Bewertungen über fünf Benchmarks (4K-2M Wortkontexte) demonstrieren die dreifache Wirksamkeit von QwenLong-CPRS: (1) Konsistente Überlegenheit gegenüber anderen Kontextverwaltungsmethoden wie RAG und spärlicher Aufmerksamkeit in Bezug auf Genauigkeit und Effizienz. (2) Architekturunabhängige Integration mit allen führenden LLMs, einschließlich GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3 und Qwen2.5-max, erreicht eine 21,59-fache Kontextkompression bei durchschnittlichen Leistungssteigerungen von 19,15 Punkten; (3) In Kombination mit Qwen2.5-32B-Instruct übertrifft QwenLong-CPRS führende proprietäre LLMs um 4,85 und 10,88 Punkte auf Ruler-128K und InfiniteBench und setzt damit neue Maßstäbe für den State-of-the-Art (SOTA).

English

This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59times context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.

QwenLong-CPRS: Auf dem Weg zu infty-LLMs mit dynamischer Kontextoptimierung

QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization

papers.abstract

Support