블록-스테이트 트랜스포머

초록

상태 공간 모델(SSMs)은 장거리 의존성을 모델링하고 긴 시퀀스에 효율적으로 확장할 수 있는 작업에서 인상적인 결과를 보여주며, 이는 그들의 서브쿼드라틱 실행 시간 복잡성 덕분입니다. 원래 연속 신호를 위해 설계된 SSMs는 비전 및 오디오 분야의 다양한 작업에서 우수한 성능을 보여주었으나, 언어 모델링 작업에서는 여전히 트랜스포머의 성능에 뒤처지고 있습니다. 본 연구에서는 장거리 문맥화를 위한 SSM 서브레이어와 단기 시퀀스 표현을 위한 블록 트랜스포머 서브레이어를 내부적으로 결합한 블록-상태 트랜스포머(BST)라는 하이브리드 레이어를 제안합니다. 우리는 SSMs와 블록 단위 어텐션을 통합한 세 가지 서로 다른, 그리고 완전히 병렬화 가능한 변형을 연구합니다. 우리의 모델이 언어 모델링 퍼플렉서티에서 유사한 트랜스포머 기반 아키텍처를 능가하며, 더 긴 시퀀스로 일반화됨을 보여줍니다. 또한, 블록-상태 트랜스포머는 모델 병렬화가 적용될 때 블록-회귀 트랜스포머에 비해 레이어 수준에서 10배 이상의 속도 증가를 보여줍니다.

English

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.