解码器-混合-解码器架构：面向长序列生成的高效推理

摘要

近期在语言建模领域的进展表明，状态空间模型（SSMs）在高效序列建模方面展现出显著成效。尽管如Samba和YOCO这类解码器-解码器混合架构相较于Transformer已显示出性能提升的潜力，但先前的研究尚未深入探讨SSM层间表示共享的效率潜力。本文中，我们引入了门控记忆单元（GMU），一种简单而有效的机制，用于实现跨层的高效记忆共享。我们将其应用于构建SambaY，一种解码器-混合-解码器架构，该架构在交叉解码器中集成GMU，以共享基于Samba的自解码器中的记忆读取状态。SambaY显著提升了解码效率，保持了线性预填充时间复杂度，并增强了长上下文性能，同时无需显式位置编码。通过广泛的扩展实验，我们证明与强大的YOCO基线相比，我们的模型展现出显著更低的不可约损失，表明在大规模计算环境下具有更优的性能可扩展性。我们最大的模型，结合差分注意力技术Phi4-mini-Flash-Reasoning，在无需任何强化学习的情况下，在Math500、AIME24/25及GPQA Diamond等推理任务上表现显著优于Phi4-mini-Reasoning，同时在vLLM推理框架下，针对2K长度提示与32K生成长度，解码吞吐量最高提升10倍。我们已在开源数据上发布了训练代码库，地址为https://github.com/microsoft/ArchScale。

English

Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.

解码器-混合-解码器架构：面向长序列生成的高效推理

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

摘要

Support