混合架構語言模型：系統性分析與設計洞見

摘要

近期在大規模語言模型領域的進展表明，混合架構——將自注意力機制與結構化狀態空間模型（如Mamba）相結合——能夠在建模質量與計算效率之間達成引人注目的平衡，特別是在長上下文任務中。儘管這些混合模型展現出令人鼓舞的性能，但關於混合策略的系統性比較以及其有效性背後關鍵因素的分析尚未清晰地與學術界分享。在本研究中，我們基於層間（順序）或層內（並行）融合，對混合架構進行了全面評估。我們從多個角度評估這些設計：語言建模性能、長上下文能力、擴展分析以及訓練和推理效率。通過探究其計算原語的核心特性，我們識別出每種混合策略中最關鍵的要素，並進一步為兩種混合模型提出了最優設計方案。我們的綜合分析為開發混合語言模型提供了實用指導和寶貴見解，促進了架構配置的優化。

English

Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

混合架構語言模型：系統性分析與設計洞見

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

摘要

Support