差分曼巴

摘要

诸如Transformer和RNN等序列模型常常过度关注无关上下文，导致中间表示充满噪声。这种现象会降低大语言模型（LLM）的能力，助长幻觉生成，削弱长距离依赖和检索能力，并降低模型的鲁棒性。近期研究表明，通过差异化的设计可以在Transformer中缓解这一问题，从而提升其在多种应用中的有效性。本文探讨了这些最初为Transformer开发的技术是否能够应用于Mamba——一种基于选择性状态空间层的新架构，该架构以更高的效率实现了与Transformer相当的性能。我们发现，直接将差异化设计简单迁移到Mamba上是不够的，需要细致的架构调整。为此，我们为Mamba引入了一种新颖的差异化机制，并在语言建模基准上进行了实证验证，展示了其相较于原始Mamba在检索能力上的提升和更优的整体表现。最后，我们进行了广泛的消融研究和实证分析，以论证我们的设计选择，并提供证据表明我们的方法有效缓解了基于Mamba模型中的过度关注问题。我们的代码已公开。

English

Sequence models like Transformers and RNNs often overallocate attention to irrelevant context, leading to noisy intermediate representations. This degrades LLM capabilities by promoting hallucinations, weakening long-range and retrieval abilities, and reducing robustness. Recent work has shown that differential design can mitigate this issue in Transformers, improving their effectiveness across various applications. In this paper, we explore whether these techniques, originally developed for Transformers, can be applied to Mamba, a recent architecture based on selective state-space layers that achieves Transformer-level performance with greater efficiency. We show that a naive adaptation of differential design to Mamba is insufficient and requires careful architectural modifications. To address this, we introduce a novel differential mechanism for Mamba, empirically validated on language modeling benchmarks, demonstrating improved retrieval capabilities and superior performance over vanilla Mamba. Finally, we conduct extensive ablation studies and empirical analyses to justify our design choices and provide evidence that our approach effectively mitigates the overallocation problem in Mamba-based models. Our code is publicly available.

Differential Mamba

摘要

Support