解碼器-混合-解碼器架構：實現長序列生成的高效推理

摘要

近期在語言建模領域的進展，已證實狀態空間模型（SSMs）在高效序列建模中的有效性。儘管如Samba及解碼器-解碼器架構YOCO等混合架構相較於Transformer展現出顯著的性能提升，先前的研究尚未探討SSM層間表示共享的效率潛力。本文中，我們引入了門控記憶單元（GMU），這是一種簡單而有效的機制，用於實現跨層的高效記憶共享。我們將其應用於創建SambaY，這是一種解碼器-混合-解碼器架構，在跨解碼器中整合GMU，以共享基於Samba的自解碼器的記憶讀取狀態。SambaY顯著提升了解碼效率，保持了線性預填充時間複雜度，並增強了長上下文性能，同時無需顯式位置編碼。通過廣泛的擴展實驗，我們展示了與強大的YOCO基線相比，我們的模型展現出顯著更低的不可約損失，表明在大規模計算環境下具有優越的性能可擴展性。我們最大的模型，通過差分注意力增強，Phi4-mini-Flash-Reasoning，在無需任何強化學習的情況下，在Math500、AIME24/25及GPQA Diamond等推理任務上，相較於Phi4-mini-Reasoning取得了顯著更好的性能，同時在vLLM推理框架下，針對2K長度提示及32K生成長度，提供高達10倍的解碼吞吐量。我們在開源數據上發布了訓練代碼庫，地址為https://github.com/microsoft/ArchScale。

English

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

解碼器-混合-解碼器架構：實現長序列生成的高效推理

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

摘要

Support