Transformer 可以表示 n-gram 語言模型。
Transformers Can Represent n-gram Language Models
April 23, 2024
作者: Anej Svete, Ryan Cotterell
cs.AI
摘要
許多現有研究已通過描述計算的正式模型來分析變壓器架構的能力。然而,到目前為止,焦點一直放在以語言接受為術語來分析架構。我們認為這在語言模型(LMs)的研究中是一個不適當的問題,因為它們在定義上是對字符串的概率分佈。在本文中,我們專注於變壓器LM和n-gram LM之間的關係,n-gram LM是一類簡單且具有歷史意義的語言模型。我們展示了使用硬或稀疏注意機制的變壓器LM可以精確表示任何n-gram LM,從而為它們的概率表示能力提供了具體的下限。這是了解變壓器LM可以用來表示字符串概率分佈的機制的第一步。
English
Plenty of existing work has analyzed the abilities of the transformer
architecture by describing its representational capacity with formal models of
computation. However, the focus so far has been on analyzing the architecture
in terms of language acceptance. We contend that this is an ill-suited
problem in the study of language models (LMs), which are definitionally
probability distributions over strings. In this paper, we focus on the
relationship between transformer LMs and n-gram LMs, a simple and
historically relevant class of language models. We show that transformer LMs
using the hard or sparse attention mechanisms can exactly represent any
n-gram LM, giving us a concrete lower bound on their probabilistic
representational capacity. This provides a first step towards understanding the
mechanisms that transformer LMs can use to represent probability distributions
over strings.Summary
AI-Generated Summary