TART：一种用于任务无关推理的即插即用Transformer模块

摘要

大型语言模型（LLMs）展现出上下文学习能力，使同一模型能够在没有任何特定任务训练的情况下执行多个任务。相比之下，传统的适应方法，如微调，会为每个特定任务修改基础模型。然而，上下文学习在面对相同示例时始终表现不佳，甚至不如特定任务调整方法。虽然大多数现有方法（例如提示工程）侧重于LLM学习的表示以弥补这一性能差距，但我们的分析实际上揭示了LLM表示包含足够信息以进行良好预测。因此，我们关注LLM的推理能力，并证明这一性能差距存在是因为它们无法执行简单的概率推理任务。这引发了一个有趣的问题：LLMs是否真的能够学会以任务无关的方式进行推理？我们肯定回答了这个问题，并提出了TART，通过使用经过合成训练的基于Transformer的推理模块，通用地提高LLM的推理能力。TART以任务无关的方式训练这个推理模块，仅使用合成逻辑回归任务，并将其与任意实际预训练模型组合，无需额外训练。通过单个推理模块，TART提高了不同模型系列（GPT-Neo、Pythia、BLOOM）、模型规模（100M至6B）、任务（14个自然语言处理二元分类任务）甚至不同模态（音频和视觉）的性能。此外，在RAFT基准测试中，TART提高了GPT-Neo（125M）的性能，使其超越了BLOOM（176B），并且与GPT-3（175B）的性能相差不到4%。我们的代码和模型可在 https://github.com/HazyResearch/TART 找到。

English

Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our analysis actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and propose TART which generically improves an LLM's reasoning abilities using a synthetically trained Transformer-based reasoning module. TART trains this reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, BLOOM), model sizes (100M - 6B), tasks (14 NLP binary classification tasks), and even across different modalities (audio and vision). Additionally, on the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms BLOOM (176B), and is within 4% of GPT-3 (175B). Our code and models are available at https://github.com/HazyResearch/TART .

TART：一种用于任务无关推理的即插即用Transformer模块

TART: A plug-and-play Transformer module for task-agnostic reasoning

摘要

Support