自己検索を活用した長距離言語モデリング

要旨

検索拡張型言語モデル（LM）は近年注目を集めている。しかし、通常、検索器はLMのネイティブな構成要素として共同で訓練されるのではなく、事前に訓練されたLMに追加されるため、LMと検索器が互いに適応する能力が制限される。本研究では、長文のモデリングタスクに向けて、検索拡張型LMをゼロから共同で訓練するためのアーキテクチャと訓練手順であるRetrieval-Pretrained Transformer（RPT）を提案する。長文書内で最近生成されたテキストチャンクが与えられると、LMはクエリ表現を計算し、それを用いて文書内の以前のチャンク（数万トークン前のものも含む）を検索する。検索されたチャンクからの情報はLM表現に融合され、次のターゲットチャンクを予測するために使用される。検索器コンポーネントは、参照LMに従って次のチャンクの確率を高めるチャンクを検索することを目的とした意味的目標で訓練される。RPTを、書籍、コード、数学的文章にわたる4つの長距離言語モデリングタスクで評価し、強力なベースラインと比較してRPTが検索品質とその後のパープレキシティを全体的に改善することを示す。

English

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch for the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.

自己検索を活用した長距離言語モデリング

Long-range Language Modeling with Self-retrieval

要旨

Support