Qwen2.5-1M テクニカルレポート

要旨

Qwen2.5-1Mというモデルシリーズを紹介します。このシリーズは、コンテキスト長を100万トークンに拡張しています。以前の128Kバージョンと比較して、Qwen2.5-1Mシリーズは、長いコンテキストの事前トレーニングと事後トレーニングを通じて、著しく向上した長いコンテキスト能力を持っています。長いデータ合成、段階的な事前トレーニング、および多段階の教師付き微調整などの主要技術が使用され、長いコンテキストの性能を効果的に向上させると同時にトレーニングコストを削減します。より広範なユーザーベースで長いコンテキストモデルの使用を促進するために、推論フレームワークを提示してオープンソース化します。このフレームワークには、モデルのコンテキスト長を少なくとも4倍、またはそれ以上拡張できる長さの外挿方法が含まれています。推論コストを削減するために、デプロイメントシナリオ向けに疎な注意メソッドとチャンク化されたプリフィル最適化を実装し、精度を向上させるための疎なリファインメントメソッドも採用しています。さらに、カーネル最適化、パイプライン並列処理、スケジューリング最適化などの推論エンジンの最適化について詳細に説明し、全体的な推論パフォーマンスを著しく向上させています。推論フレームワークを活用することで、Qwen2.5-1Mモデルは、100万トークンのコンテキストを持つシナリオで驚異的な3倍から7倍のプリフィル高速化を実現しています。このフレームワークは、オープンソースモデルを使用して長いコンテキスト処理を必要とするアプリケーションの開発に効率的かつ強力なソリューションを提供します。 Qwen2.5-1Mシリーズには、オープンソースモデルのQwen2.5-7B-Instruct-1MとQwen2.5-14B-Instruct-1M、およびAPIアクセスモデルのQwen2.5-Turboが含まれています。評価によると、Qwen2.5-1Mモデルは、長いコンテキストタスクで大幅に改善されており、短いコンテキストシナリオでのパフォーマンスを損なうことなく、特にQwen2.5-14B-Instruct-1Mモデルは、長いコンテキストタスクでGPT-4o-miniを大幅に上回り、8倍長いコンテキストをサポートしています。

English

We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

Qwen2.5-1M テクニカルレポート

Qwen2.5-1M Technical Report

要旨

Support