스케일링 법칙을 넘어: 연상 메모리를 통해 이해하는 트랜스포머 성능

초록

Transformer 모델의 크기를 증가시키는 것이 항상 성능 향상으로 이어지는 것은 아닙니다. 이러한 현상은 경험적 스케일링 법칙으로 설명할 수 없습니다. 더 나아가, 모델이 훈련 샘플을 암기함에 따라 일반화 능력이 개선되는 현상이 발생합니다. 우리는 Transformer 기반 언어 모델의 암기 과정과 성능 역학을 밝히는 이론적 프레임워크를 제시합니다. 우리는 Hopfield 네트워크를 사용하여 연관 메모리를 가진 Transformer의 동작을 모델링하여, 각 Transformer 블록이 효과적으로 근사 최근접 이웃 탐색을 수행하도록 합니다. 이를 바탕으로, 우리는 현대적 연속 Hopfield 네트워크와 유사한 에너지 함수를 설계하여 어텐션 메커니즘에 대한 통찰력 있는 설명을 제공합니다. Majorization-minimization 기법을 사용하여, 우리는 Transformer의 계층적 아키텍처를 포착하는 전역 에너지 함수를 구성합니다. 특정 조건 하에서, 우리는 달성 가능한 최소 교차 엔트로피 손실이 약 1에 가까운 상수로 하한이 있음을 보입니다. 우리는 다양한 데이터 크기에 대해 GPT-2를 실험하고, 2M 토큰 데이터셋에서 기본 Transformer를 훈련시켜 우리의 이론적 결과를 입증합니다.

English

Increasing the size of a Transformer model does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, improved generalization ability occurs as the model memorizes the training samples. We present a theoretical framework that sheds light on the memorization process and performance dynamics of transformer-based language models. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. Based on this, we design an energy function analogous to that in the modern continuous Hopfield network which provides an insightful explanation for the attention mechanism. Using the majorization-minimization technique, we construct a global energy function that captures the layered architecture of the Transformer. Under specific conditions, we show that the minimum achievable cross-entropy loss is bounded from below by a constant approximately equal to 1. We substantiate our theoretical results by conducting experiments with GPT-2 on various data sizes, as well as training vanilla Transformers on a dataset of 2M tokens.

스케일링 법칙을 넘어: 연상 메모리를 통해 이해하는 트랜스포머 성능

Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory

초록

Support