그록된 트랜스포머는 암묵적 추론자입니다: 일반화의 경계를 향한 기계적 여정

초록

우리는 트랜스포머가 파라미터적 지식에 대해 암묵적으로 추론하는 능력을 학습할 수 있는지 연구합니다. 이는 가장 강력한 언어 모델들도 어려워하는 기술입니다. 대표적인 두 가지 추론 유형인 조합(composition)과 비교(comparison)에 초점을 맞추어, 트랜스포머가 암묵적 추론을 학습할 수 있지만 그럭(grokking), 즉 과적합을 훨씬 넘어선 장기간의 훈련을 통해서만 가능하다는 것을 일관되게 발견했습니다. 일반화 수준도 추론 유형에 따라 다르게 나타났습니다: 분포 외(out-of-distribution) 예제에 직면했을 때, 트랜스포머는 조합에 대해 체계적으로 일반화하는 데 실패했지만 비교에서는 성공했습니다. 우리는 훈련 과정 전반에 걸쳐 모델의 내부를 깊이 분석하며, 다음을 밝히는 실험을 수행했습니다: 1) 그럭의 메커니즘, 예를 들어 일반화 회로의 형성과 이를 일반화 회로와 기억 회로의 상대적 효율성과의 관계, 그리고 2) 체계성(systematicity)과 일반화 회로의 구성 간의 연결. 우리의 연구 결과는 암묵적 추론을 더 잘 유도하기 위한 데이터 및 훈련 설정에 대한 가이드를 제공하며, 크로스 레이어 지식 공유를 촉진하는 등 트랜스포머 아키텍처의 잠재적 개선 방안을 제안합니다. 또한, 우리는 큰 탐색 공간을 가진 도전적인 추론 과제에서, 비파라미터적 메모리에 기반한 GPT-4-Turbo와 Gemini-1.5-Pro가 프롬프트 스타일이나 검색 증강(retrieval augmentation)에 관계없이 심각하게 실패하는 반면, 완전히 그럭된 트랜스포머는 거의 완벽한 정확도를 달성할 수 있음을 보여주며, 복잡한 추론을 위한 파라미터적 메모리의 힘을 입증했습니다.

English

We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

그록된 트랜스포머는 암묵적 추론자입니다: 일반화의 경계를 향한 기계적 여정

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

초록

Support