長文生成を効率的に推論するためのDecoder-Hybrid-Decoderアーキテクチャ

要旨

近年の言語モデリングの進展により、効率的なシーケンスモデリングにおける状態空間モデル（SSMs）の有効性が実証されてきた。Sambaやデコーダ-デコーダアーキテクチャであるYOCOなどのハイブリッドアーキテクチャは、Transformerを上回る有望な性能向上を示しているが、これまでの研究ではSSM層間の表現共有の効率性の可能性は検討されていない。本論文では、層間で効率的にメモリを共有するためのシンプルかつ効果的なメカニズムであるGated Memory Unit（GMU）を提案する。これを適用し、Sambaベースの自己デコーダからメモリ読み出し状態を共有するためにクロスデコーダにGMUを組み込んだデコーダ-ハイブリッド-デコーダアーキテクチャであるSambaYを構築する。SambaYは、デコーディング効率を大幅に向上させ、線形のプリフィル時間計算量を維持し、長文脈性能を向上させるとともに、明示的な位置符号化の必要性を排除する。大規模なスケーリング実験を通じて、我々のモデルが強力なYOCOベースラインと比較して著しく低い不可避損失を示し、大規模計算体制下での優れた性能スケーラビリティを示すことを実証する。Differential Attentionを強化した最大のモデルであるPhi4-mini-Flash-Reasoningは、Math500、AIME24/25、GPQA Diamondなどの推論タスクにおいて、Phi4-mini-Reasoningよりも大幅に優れた性能を達成し、vLLM推論フレームワーク下で2K長のプロンプトと32K生成長において最大10倍のデコーディングスループットを提供する。我々は、オープンソースデータでのトレーニングコードベースをhttps://github.com/microsoft/ArchScaleで公開する。

English

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

長文生成を効率的に推論するためのDecoder-Hybrid-Decoderアーキテクチャ

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

要旨

Support