Transformer可以表示n-gram语言模型。

摘要

大量现有研究已经通过描述计算形式模型来分析变压器架构的能力。然而，迄今为止，重点一直放在以语言接受为基础来分析架构。我们认为这在语言模型（LMs）研究中是一个不合适的问题，因为它们在定义上是字符串上的概率分布。在本文中，我们专注于变压器LM和n-gram LM之间的关系，n-gram LM是一种简单且具有历史意义的语言模型类别。我们展示了使用硬性或稀疏注意机制的变压器LM可以精确表示任何n-gram LM，从而为它们的概率表示能力提供了一个具体的下限。这为了解变压器LM可以用来表示字符串上的概率分布的机制迈出了第一步。

English

Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language acceptance. We contend that this is an ill-suited problem in the study of language models (LMs), which are definitionally probability distributions over strings. In this paper, we focus on the relationship between transformer LMs and n-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any n-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

Transformer可以表示n-gram语言模型。

Transformers Can Represent n-gram Language Models

摘要

Support