Hymba: 小規模言語モデルのためのハイブリッドヘッドアーキテクチャ

要旨

私たちは、効率を向上させるためにトランスフォーマーの注意機構を状態空間モデル（SSM）と統合したハイブリッドヘッド並列アーキテクチャを特徴とする一連の小規模言語モデルであるHymbaを提案します。注意ヘッドは高解像度のリコールを提供し、一方でSSMヘッドは効率的なコンテキストの要約を可能にします。さらに、プロンプトに先行して重要な情報を格納し、注意機構に関連する「強制的な注意」の負担を軽減する学習可能なメタトークンを導入します。このモデルは、クロスレイヤーのキー・バリュー（KV）共有と部分的なスライディングウィンドウアテンションを組み込むことでさらに最適化され、コンパクトなキャッシュサイズを実現します。開発中に、異なるアーキテクチャを同一の設定で比較する制御された研究を実施し、提案されたアーキテクチャの著しい利点を観察しました。特に、Hymbaは小規模言語モデルにおいて最先端の結果を達成しています。Hymba-1.5B-Baseモデルは、すべての2B未満のパブリックモデルを凌駕し、Llama-3.2-3Bよりも平均精度が1.32％高く、キャッシュサイズが11.67倍、スループットが3.49倍向上しています。

English

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Hymba: 小規模言語モデルのためのハイブリッドヘッドアーキテクチャ

Hymba: A Hybrid-head Architecture for Small Language Models

要旨

Support