선형 트랜스포머는 다양한 맥락 학습에 유연하게 적용 가능합니다.

초록

최근 연구에 따르면, 특히 선형 어텐션 모델과 같은 트랜스포머는 순방향 추론 단계에서 컨텍스트 내 제공된 데이터에 대해 경사 하강법과 유사한 알고리즘을 암묵적으로 실행하는 것으로 나타났습니다. 그러나 이들이 더 복잡한 문제를 처리하는 능력은 아직 탐구되지 않았습니다. 본 논문에서는 모든 선형 트랜스포머가 암묵적인 선형 모델을 유지하며, 사전 조건화된 경사 하강법의 변형을 수행하는 것으로 해석될 수 있음을 증명합니다. 또한, 우리는 훈련 데이터가 다양한 수준의 노이즈로 오염된 어려운 시나리오에서 선형 트랜스포머의 사용을 조사합니다. 특히, 이 문제에 대해 선형 트랜스포머가 복잡하고 매우 효과적인 최적화 알고리즘을 발견하며, 여러 합리적인 베이스라인을 능가하거나 그에 맞먹는 성능을 보인다는 점을 입증합니다. 우리는 이 알고리즘을 역공학하여, 노이즈 수준에 기반한 모멘텀과 적응형 리스케일링을 통합한 새로운 접근 방식임을 보여줍니다. 우리의 연구 결과는 심지어 선형 트랜스포머도 정교한 최적화 전략을 발견할 수 있는 놀라운 능력을 가지고 있음을 보여줍니다.

English

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

선형 트랜스포머는 다양한 맥락 학습에 유연하게 적용 가능합니다.

Linear Transformers are Versatile In-Context Learners

초록

Support