세사: 선택적 상태 공간 어텐션

초록

현대 시퀀스 모델링은 주로 두 가지 계열이 지배하고 있다: 가시적 시퀀스의 임의 요소에 접근 가능한 자기 주의(self-attention)를 사용하는 트랜스포머(Transformers)와, 명시적 순환 상태를 통해 정보를 전파하는 구조적 상태 공간 모델(structured state-space models)이다. 이러한 메커니즘은 긴 문맥에서 각기 다른 한계에 직면한다: 주의가 확산될 경우 개별 토큰의 영향력이 유효 지원 범위 전체에 희석되는 반면, 순환 상태 전파는 정보가 능동적으로 보존되지 않으면 장거리 민감도를 상실할 수 있다. 결과적으로 두 메커니즘 모두 긴 문맥에서 정보를 보존하고 선택적으로 검색하는 데 어려움을 겪는다. 우리는 주의를 순환 피드백 경로 내에 배치하는 디코더인 Sessa를 제안한다. 이는 단일 주의 읽기나 단일 순환 체인에 의존하기보다는, 과거 토큰들이 미래 상태에 영향을 미칠 수 있는 다수의 주의 기반 경로를 생성한다. 우리는 명시적 가정과 일치하는 체제 하에서 Sessa가 0 < β < 1인 멱함수 꼬리 기억 O(ell^{-β})을 허용하며, 이는 해당 트랜스포머 및 Mamba 스타일 기준 모델보다 느린 감쇠 속도를 보인다는 것을 증명한다. 또한 이 멱함수 속도를 달성하는 명시적 구성을 제시한다. 동일한 가정 하에서 Sessa는 고려된 모델 클래스 중 거리에 따라 영향력이 감쇠하지 않는 프로파일을 포함하여 유연한 선택적 검색을 실현하는 유일한 모델 클래스이다. 이러한 이론적 이점과 일관되게, 일치하는 실험 전반에 걸쳐 Sessa는 짧은 문맥 언어 모델링에서는 트랜스포머 및 Mamba 스타일 기준 모델과 경쟁력을 유지하면서도 긴 문맥 벤치마크에서 가장 강력한 성능을 달성한다.

English

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ell^{-β}) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.

세사: 선택적 상태 공간 어텐션

Sessa: Selective State Space Attention

초록

Support