ATLAS: 테스트 시간에 컨텍스트를 최적으로 기억하도록 학습하기

초록

트랜스포머는 시퀀스 모델링에서 가장 대중적인 백본으로 자리 잡았으며, 이는 주로 컨텍스트 내 검색 작업에서의 효과성과 대규모 학습 능력 덕분입니다. 그러나 트랜스포머의 이차 메모리 및 시간 복잡도는 더 긴 시퀀스에서의 적용 가능성을 제한하며, 이로 인해 현대적인 순환 신경망(일명 장기 순환 메모리 모듈)과 같은 효과적인 대체 아키텍처 탐구가 촉진되었습니다. 이러한 모델들이 다양한 다운스트림 작업에서 최근 성공을 거두었음에도 불구하고, 장기 컨텍스트 이해와 더 긴 시퀀스로의 외삽이 필요한 작업에서는 어려움을 겪습니다. 우리는 이러한 단점이 설계상의 세 가지 분리된 측면에서 비롯된다고 관찰했습니다: (1) 메모리 아키텍처와 입력의 특징 매핑에 의해 제한되는 메모리 용량, (2) 업데이트의 온라인 특성, 즉 마지막 입력에 대해서만 메모리를 최적화하는 방식, (3) 고정 크기 메모리의 덜 표현적인 관리. 이 세 가지 측면을 모두 개선하기 위해, 우리는 ATLAS라는 고용량 장기 메모리 모듈을 제안합니다. ATLAS는 현재 및 과거 토큰을 기반으로 메모리를 최적화하여 컨텍스트를 기억하는 방법을 학습함으로써 장기 메모리 모델의 온라인 특성을 극복합니다. 이러한 통찰을 바탕으로, 우리는 원래 트랜스포머 아키텍처의 엄격한 일반화인 DeepTransformers라는 새로운 트랜스포머 유사 아키텍처 패밀리를 제시합니다. 언어 모델링, 상식 추론, 회고 집약적 작업, 장기 컨텍스트 이해 작업에 대한 실험 결과는 ATLAS가 트랜스포머와 최근의 선형 순환 모델의 성능을 능가함을 보여줍니다. ATLAS는 Titans의 장기 컨텍스트 성능을 더욱 향상시켜, BABILong 벤치마크의 10M 컨텍스트 길이에서 +80% 정확도를 달성합니다.

English

Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.

ATLAS: 테스트 시간에 컨텍스트를 최적으로 기억하도록 학습하기

ATLAS: Learning to Optimally Memorize the Context at Test Time

초록

Support