모든 것이 연결되어 있다: 테스트 시간 기억화, 주의 편향, 보유, 그리고 온라인 최적화를 통한 여정

초록

효율적이고 효과적인 아키텍처 백본 설계는 파운데이션 모델의 능력을 향상시키기 위한 연구 노력의 핵심이 되어 왔습니다. 인간의 인지 현상인 주의 편향(attentional bias) - 특정 사건이나 자극을 우선적으로 처리하는 자연스러운 경향 - 에 영감을 받아, 우리는 트랜스포머, 타이탄, 현대적인 선형 순환 신경망을 포함한 신경 아키텍처를 키와 값의 매핑을 학습하는 연관 메모리 모듈로 재개념화했습니다. 이때 내부 목표로 주의 편향을 사용합니다. 놀랍게도, 우리는 대부분의 기존 시퀀스 모델이 (1) 내적 유사성(dot-product similarity) 또는 (2) L2 회귀 목표를 주의 편향으로 활용하고 있음을 관찰했습니다. 이러한 목표를 넘어서서, 우리는 훈련 과정을 안정화하기 위한 효과적인 근사와 함께 대체 주의 편향 구성을 제시합니다. 그런 다음, 현대 딥러닝 아키텍처에서의 망각 메커니즘을 보유 정규화(retention regularization)의 한 형태로 재해석하여, 시퀀스 모델을 위한 새로운 망각 게이트(forget gate) 세트를 제공합니다. 이러한 통찰을 바탕으로, 우리는 (i) 연관 메모리 아키텍처, (ii) 주의 편향 목표, (iii) 보유 게이트, (iv) 메모리 학습 알고리즘의 네 가지 선택을 기반으로 딥러닝 아키텍처를 설계하는 일반 프레임워크인 Miras를 제시합니다. 우리는 기존의 선형 RNN의 성능을 넘어서면서도 빠르고 병렬화 가능한 훈련 과정을 유지하는 세 가지 새로운 시퀀스 모델 - Moneta, Yaad, Memora - 를 소개합니다. 우리의 실험은 Miras의 다양한 설계 선택이 각기 다른 강점을 가진 모델을 만들어냄을 보여줍니다. 예를 들어, Miras의 특정 인스턴스는 언어 모델링, 상식 추론, 회고 집중 작업과 같은 특수 작업에서 트랜스포머 및 기타 현대적인 선형 순환 모델을 능가하는 탁월한 성능을 달성합니다.

English

Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.

모든 것이 연결되어 있다: 테스트 시간 기억화, 주의 편향, 보유, 그리고 온라인 최적화를 통한 여정

It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

초록

Support