트랜스포머는 길이 일반화를 달성할 수 있지만, 강건하게는 아니다.

초록

길이 일반화(length generalization)는 더 짧은 훈련 시퀀스에서 더 긴 테스트 시퀀스로 외삽(extrapolate)할 수 있는 능력으로 정의되며, 언어 모델에게 중요한 도전 과제입니다. 이 문제는 비교적 단순한 작업을 처리하는 대규모 트랜스포머(Transformer)에서도 여전히 존재합니다. 본 논문에서는 두 정수의 덧셈 작업을 사용하여 트랜스포머의 길이 일반화 능력을 테스트합니다. 우리는 길이 일반화의 성공이 데이터 형식과 위치 인코딩(position encoding)의 유형과 복잡하게 연결되어 있음을 보여줍니다. 적절한 데이터 형식과 위치 인코딩의 조합을 사용하여, 표준 트랜스포머가 입력 길이의 2.5배에 달하는 시퀀스 길이로 외삽할 수 있음을 처음으로 입증합니다. 그러나 분포 내 일반화(in-distribution generalization)와 달리, 길이 일반화는 여전히 취약하며, 무작위 가중치 초기화와 훈련 데이터 순서와 같은 요인에 크게 영향을 받아 서로 다른 무작위 시드(random seed) 간에 큰 변동을 보입니다.

English

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

트랜스포머는 길이 일반화를 달성할 수 있지만, 강건하게는 아니다.

Transformers Can Achieve Length Generalization But Not Robustly

초록

Support