준2차 아키텍처에 관하여: 응용에서 원리까지

초록

트랜스포머는 현대 시퀀스 모델링을 지배하지만, 제곱 복잡도 어텐션은 상당한 계산 비용을 초래한다. 서브쿼드러틱 아키텍처는 확장 가능한 대안을 제공한다. 그러나 어떤 설계가 가장 효과적인 시퀀스 모델을 도출하는지는 여전히 명확하지 않다. 우리는 세 가지 주요 접근법인 xLSTM, Mamba-2, Gated DeltaNet을 비교한다. 복잡한 의존성을 가진 과제, 즉 (1) 코드 모델 사전 학습, (2) 대규모 언어 모델로부터 코드 모델의 증류, (3) 시계열 기반 모델 사전 학습에서 이들을 평가한다. 이러한 설정 전반에 걸쳐 xLSTM이 가장 강력한 전반적 성능을 보여준다. xLSTM의 이점을 설명하기 위해, 우리는 통합된 정식화를 제시하고, 상태 추적과 메모리 동역학에 초점을 맞춰 기본 아키텍처 메커니즘을 분석한다. 결과는 xLSTM이 게이팅 방식을 통해 보다 유연하고 안정적인 메모리 교정을 가능하게 함을 보여준다. 우리는 이러한 발견을 통제된 합성 길이 일반화 과제에서 확인한다. 전반적으로, 우리의 결과는 xLSTM의 복잡한 과제에 대한 성능 향상이 강력한 상태 추적과 누적에서 비롯됨을 시사한다.

English

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.