Transformerはn-gram言語モデルを表現可能である

要旨

既存の研究の多くは、トランスフォーマーアーキテクチャの能力を、計算の形式的モデルを用いてその表現能力を記述することで分析してきました。しかし、これまでの焦点は主に言語受理の観点からこのアーキテクチャを分析することに置かれてきました。我々は、このアプローチが言語モデル（LM）の研究において不適切であると主張します。なぜなら、言語モデルは定義上、文字列上の確率分布だからです。本論文では、トランスフォーマーLMとn-gram LMという、シンプルで歴史的に重要なクラスの言語モデルとの関係に焦点を当てます。我々は、ハードまたはスパースアテンションメカニズムを使用するトランスフォーマーLMが、任意のn-gram LMを正確に表現できることを示し、それらの確率的表現能力に関する具体的な下限を与えます。これは、トランスフォーマーLMが文字列上の確率分布を表現するために使用できるメカニズムを理解するための第一歩を提供します。

English

Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language acceptance. We contend that this is an ill-suited problem in the study of language models (LMs), which are definitionally probability distributions over strings. In this paper, we focus on the relationship between transformer LMs and n-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any n-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

Transformerはn-gram言語モデルを表現可能である

Transformers Can Represent n-gram Language Models

要旨

Support