Modelagem de Linguagem de Longo Alcance com Autorecuperação

Resumo

Modelos de linguagem aumentados por recuperação (LMs) têm recebido muita atenção recentemente. No entanto, normalmente o recuperador não é treinado conjuntamente como um componente nativo do LM, mas adicionado a um LM já pré-treinado, o que limita a capacidade do LM e do recuperador de se adaptarem mutuamente. Neste trabalho, propomos o Transformer Pré-treinado com Recuperação (RPT), uma arquitetura e procedimento de treinamento para treinar conjuntamente um LM aumentado por recuperação desde o início para a tarefa de modelagem de textos longos. Dado um trecho de texto recentemente gerado em um documento longo, o LM calcula representações de consulta, que são então usadas para recuperar trechos anteriores do documento, localizados potencialmente dezenas de milhares de tokens antes. As informações dos trechos recuperados são fundidas nas representações do LM para prever o próximo trecho alvo. Treinamos o componente de recuperação com um objetivo semântico, onde o objetivo é recuperar trechos que aumentam a probabilidade do próximo trecho, de acordo com um LM de referência. Avaliamos o RPT em quatro tarefas de modelagem de linguagem de longo alcance, abrangendo livros, código e escrita matemática, e demonstramos que o RPT melhora a qualidade de recuperação e, consequentemente, a perplexidade em geral em comparação com bases fortes.

English

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch for the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.

Modelagem de Linguagem de Longo Alcance com Autorecuperação

Long-range Language Modeling with Self-retrieval

Resumo

Support