微分マンバ

要旨

TransformerやRNNなどのシーケンスモデルは、しばしば無関係な文脈に対して過剰な注意を割り当て、ノイズの多い中間表現を生成する。これにより、LLM（大規模言語モデル）の能力が低下し、幻覚の促進、長距離依存性や検索能力の弱体化、ロバスト性の低下が引き起こされる。最近の研究では、Transformerにおいて差分設計を用いることでこの問題を緩和し、さまざまなアプリケーションでの有効性を向上させることが示されている。本論文では、これらの技術が、Transformerと同等の性能をより効率的に達成する選択的状態空間層に基づく最近のアーキテクチャであるMambaに適用可能かどうかを探る。我々は、差分設計をMambaに単純に適用するだけでは不十分であり、慎重なアーキテクチャの変更が必要であることを示す。この問題に対処するため、我々はMamba向けの新たな差分メカニズムを提案し、言語モデリングベンチマークで実証的に検証を行い、検索能力の向上とvanilla Mambaを上回る性能を示す。最後に、設計選択を正当化し、我々のアプローチがMambaベースのモデルにおける過剰割り当て問題を効果的に緩和することを示すために、広範なアブレーション研究と実証分析を実施する。我々のコードは公開されている。

English

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.

微分マンバ

Differential Mamba

要旨

Support