通过最近性和过度平滑的视角理解和缓解状态空间模型的瓶颈问题

摘要

结构化状态空间模型（SSMs）已经成为变压器的替代方案。虽然SSMs经常被认为在捕捉长序列依赖性方面很有效，但我们严格证明它们固有地受到强烈的最近偏差的限制。我们的实证研究还揭示了这种偏差损害了模型回忆远程信息的能力并引入了鲁棒性问题。我们的扩展实验随后发现，SSMs中更深层的结构可以促进学习长上下文。然而，随后的理论分析揭示，随着SSMs的加深，它们表现出另一个不可避免的过度平滑的倾向，例如，标记表示变得越来越难以区分。最近偏差和过度平滑之间的这种基本困境阻碍了现有SSMs的可扩展性。受到我们理论发现的启发，我们提出在SSMs中极化状态转移矩阵的两个通道，将它们分别设为零和一，从而同时解决最近偏差和过度平滑问题。实验证明，我们的极化技术始终提高了长距离标记的联想回忆准确性，并使SSMs能够进一步受益于更深的架构。所有源代码均在https://github.com/VITA-Group/SSM-Bottleneck上发布。

English

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

通过最近性和过度平滑的视角理解和缓解状态空间模型的瓶颈问题

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

摘要

Support