差分曼巴

摘要

序列模型如Transformer和RNN常常過度關注不相關的上下文，導致中間表示充滿噪聲。這會降低大型語言模型的能力，助長幻覺生成，削弱長距離依賴和檢索能力，並降低模型的魯棒性。最近的研究表明，差異化設計可以緩解Transformer中的這一問題，提升其在各種應用中的效能。本文探討這些最初為Transformer開發的技術是否能夠應用於Mamba——一種基於選擇性狀態空間層的新架構，該架構以更高的效率達到了Transformer級別的表現。我們發現，將差異化設計簡單地套用於Mamba並不足夠，需要進行細緻的架構調整。為此，我們為Mamba引入了一種新穎的差異化機制，並在語言建模基準上進行了實證驗證，展示了其相較於原始Mamba在檢索能力上的提升和更優的整體表現。最後，我們進行了廣泛的消融研究和實證分析，以證明我們的設計選擇，並提供證據表明我們的方法有效緩解了基於Mamba模型中的過度關注問題。我們的代碼已公開提供。

English

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.

Differential Mamba

摘要

Support