语言模型混合架构：系统分析与设计洞见

摘要

近期在大语言模型领域的进展表明，混合架构——将自注意力机制与结构化状态空间模型（如Mamba）相结合——能够在建模质量与计算效率之间取得引人注目的平衡，尤其是在处理长上下文任务时。尽管这些混合模型展现出令人期待的性能，但关于混合策略的系统性比较及其有效性背后关键因素的分析尚未在社区中明确分享。在本研究中，我们对基于层间（顺序）或层内（并行）融合的混合架构进行了全面评估。我们从多个角度评估这些设计：语言建模性能、长上下文处理能力、扩展性分析以及训练与推理效率。通过探究其计算原语的核心特征，我们识别出每种混合策略中最关键的元素，并进一步为两种混合模型提出了最优设计方案。我们的综合分析为开发混合语言模型提供了实用指导和宝贵见解，有助于优化架构配置。

English

Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

语言模型混合架构：系统分析与设计洞见

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

摘要

Support