Qwen2.5-1M 技術報告

摘要

我們介紹了 Qwen2.5-1M 系列模型，將上下文長度擴展至 1 百萬個標記。與之前的 128K 版本相比，Qwen2.5-1M 系列通過長上下文預訓練和後訓練顯著增強了長上下文能力。採用了長數據合成、漸進式預訓練和多階段監督微調等關鍵技術，有效提升了長上下文性能同時降低了訓練成本。為了推廣長上下文模型在更廣泛的用戶群中的應用，我們提出並開源了我們的推理框架。該框架包括一種長度外推方法，可以將模型上下文長度至少擴展四倍，甚至更多，而無需額外訓練。為了降低推理成本，我們實現了稀疏注意力方法，以及用於部署場景的分塊預填充優化和用於提高精度的稀疏度優化方法。此外，我們詳細介紹了推理引擎中的優化，包括核心優化、管道並行性和調度優化，顯著提升了整體推理性能。通過利用我們的推理框架，Qwen2.5-1M 模型在包含 1 百萬個上下文標記的場景中實現了顯著的 3 到 7 倍的預填充加速。該框架為使用開源模型進行長上下文處理的應用程序開發提供了高效且強大的解決方案。 Qwen2.5-1M 系列目前包括開源模型 Qwen2.5-7B-Instruct-1M 和 Qwen2.5-14B-Instruct-1M，以及 API 訪問模型 Qwen2.5-Turbo。評估顯示，Qwen2.5-1M 模型在長上下文任務中有了很大的改進，而在短上下文場景中性能沒有受到損害。具體來說，Qwen2.5-14B-Instruct-1M 模型在長上下文任務中顯著優於 GPT-4o-mini，並支持長度為其八倍的上下文。

English

We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

Qwen2.5-1M 技術報告

Qwen2.5-1M Technical Report

摘要

Support