트랜스포머는 언제 추상적인 기호를 사용하여 추론할 수 있는가?

초록

우리는 추상 기호를 포함하는 관계적 추론 작업에서 트랜스포머 대형 언어 모델(LLMs)의 능력을 조사한다. 이러한 작업은 프로그래밍, 수학, 언어적 추론과 같은 더 복잡한 능력의 기본 구성 요소로서 신경과학 문헌에서 오랫동안 연구되어 왔다. (i) 회귀 작업의 경우, 트랜스포머가 학습 시 일반화할 수 있음을 증명하지만, 놀라울 정도로 많은 양의 학습 데이터가 필요하다는 것을 보여준다. (ii) 기호 레이블을 사용한 다음 토큰 예측 작업의 경우, 트랜스포머가 임베딩 차원이 증가함에 따라 일반화하지 못하는 "역 스케일링 법칙"을 보여준다. (i)와 (ii) 두 설정 모두에서, 헤드당 두 개의 학습 가능한 매개변수를 추가하여 필요한 데이터 양을 줄일 수 있는 미세한 트랜스포머 수정을 제안한다.

English

We investigate the capabilities of transformer large language models (LLMs) on relational reasoning tasks involving abstract symbols. Such tasks have long been studied in the neuroscience literature as fundamental building blocks for more complex abilities in programming, mathematics, and verbal reasoning. For (i) regression tasks, we prove that transformers generalize when trained, but require astonishingly large quantities of training data. For (ii) next-token-prediction tasks with symbolic labels, we show an "inverse scaling law": transformers fail to generalize as their embedding dimension increases. For both settings (i) and (ii), we propose subtle transformer modifications which can reduce the amount of data needed by adding two trainable parameters per head.

트랜스포머는 언제 추상적인 기호를 사용하여 추론할 수 있는가?

When can transformers reason with abstract symbols?

초록

Support