Transformer可以表示n-gram语言模型。
Transformers Can Represent n-gram Language Models
April 23, 2024
作者: Anej Svete, Ryan Cotterell
cs.AI
摘要
大量现有研究已经通过描述计算形式模型来分析变压器架构的能力。然而,迄今为止,重点一直放在以语言接受为基础来分析架构。我们认为这在语言模型(LMs)研究中是一个不合适的问题,因为它们在定义上是字符串上的概率分布。在本文中,我们专注于变压器LM和n-gram LM之间的关系,n-gram LM是一种简单且具有历史意义的语言模型类别。我们展示了使用硬性或稀疏注意机制的变压器LM可以精确表示任何n-gram LM,从而为它们的概率表示能力提供了一个具体的下限。这为了解变压器LM可以用来表示字符串上的概率分布的机制迈出了第一步。
English
Plenty of existing work has analyzed the abilities of the transformer
architecture by describing its representational capacity with formal models of
computation. However, the focus so far has been on analyzing the architecture
in terms of language acceptance. We contend that this is an ill-suited
problem in the study of language models (LMs), which are definitionally
probability distributions over strings. In this paper, we focus on the
relationship between transformer LMs and n-gram LMs, a simple and
historically relevant class of language models. We show that transformer LMs
using the hard or sparse attention mechanisms can exactly represent any
n-gram LM, giving us a concrete lower bound on their probabilistic
representational capacity. This provides a first step towards understanding the
mechanisms that transformer LMs can use to represent probability distributions
over strings.Summary
AI-Generated Summary