트랜스포머는 n-gram 언어 모델을 표현할 수 있다

초록

기존의 많은 연구들은 트랜스포머 아키텍처의 능력을 계산의 형식적 모델을 통해 표현력을 설명함으로써 분석해 왔습니다. 그러나 지금까지의 초점은 언어 수용 측면에서 아키텍처를 분석하는 데 맞춰져 있었습니다. 우리는 이것이 문자열에 대한 확률 분포로 정의되는 언어 모델(LM) 연구에는 적합하지 않은 문제라고 주장합니다. 본 논문에서는 트랜스포머 LM과 n-gram LM 간의 관계에 초점을 맞춥니다. n-gram LM은 단순하면서도 역사적으로 중요한 언어 모델 클래스입니다. 우리는 하드 또는 희소 주의 메커니즘을 사용하는 트랜스포머 LM이 모든 n-gram LM을 정확히 표현할 수 있음을 보여줌으로써, 이들의 확률적 표현 능력에 대한 구체적인 하한을 제시합니다. 이는 트랜스포머 LM이 문자열에 대한 확률 분포를 표현하기 위해 사용할 수 있는 메커니즘을 이해하기 위한 첫 번째 단계를 제공합니다.

English

Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language acceptance. We contend that this is an ill-suited problem in the study of language models (LMs), which are definitionally probability distributions over strings. In this paper, we focus on the relationship between transformer LMs and n-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any n-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

트랜스포머는 n-gram 언어 모델을 표현할 수 있다

Transformers Can Represent n-gram Language Models

초록

Support