Sessa：选择性状态空间注意力机制

摘要

现代序列建模主要由两大体系主导：一是Transformer模型，其自注意力机制能够访问可见序列中的任意元素；二是结构化状态空间模型，通过显式循环状态传递信息。这两种机制在长上下文处理中各存局限：当注意力分散时，单个标记的影响力会在有效支撑范围内被稀释；而循环状态传播除非主动保存信息，否则可能丧失长程敏感性。因此，两种机制在长上下文中都面临信息保持与选择性提取的挑战。我们提出Sessa解码器，将注意力置于循环反馈路径中。该设计构建了多条基于注意力的路径，使历史标记能通过多种方式影响未来状态，而非依赖单一注意力读取或单一循环链。我们证明，在明确假设与匹配机制下，Sessa可实现幂律记忆尾迹O(ℓ^{-β})（0<β<1），其衰减速度慢于对应的Transformer和Mamba类基线模型。我们还给出了实现该幂律速率的显式构造。在相同假设下，Sessa是所考察模型中唯一能实现灵活选择性提取的类别，包括影响力不随距离衰减的分布模式。与这一理论优势一致，在匹配实验中，Sessa在长上下文基准测试中表现最优，同时在短上下文语言建模任务中保持与Transformer及Mamba类基线相当的竞争力。

English

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ell^{-β}) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.

Sessa：选择性状态空间注意力机制

Sessa: Selective State Space Attention

摘要

Support