TART：一個針對任務不可知推理的即插即用Transformer模組

摘要

大型語言模型（LLMs）展現了上下文學習能力，使得同一模型能夠執行多項任務而無需進行任務特定的訓練。相較之下，傳統的適應方法，如微調，會針對每個特定任務修改底層模型。然而，上下文學習在面對相同範例時，仍然表現不如特定任務調整方法。儘管大多數現有方法（例如提示工程）專注於修補LLM學習的表示以彌補這種表現差距，我們的分析實際上揭示了LLM的表示包含足夠的信息以做出良好的預測。因此，我們專注於LLM的推理能力，並證明這種表現差距是由於它們無法執行簡單的概率推理任務。這帶出了一個有趣的問題：LLMs是否真的能夠學習如何以一種與任務無關的方式進行推理？我們肯定地回答這個問題，並提出TART，通過使用經過合成訓練的基於Transformer的推理模組，通用地提升LLM的推理能力。TART以一種與任務無關的方式訓練這個推理模組，僅使用合成的邏輯回歸任務，並將其與任意現實世界的預訓練模型組合，而無需進行任何額外的訓練。通過單一的推理模組，TART提升了不同模型家族（GPT-Neo、Pythia、BLOOM）、模型大小（100M - 6B）、任務（14個自然語言處理二元分類任務）以及不同模態（音訊和視覺）的表現。此外，在RAFT基準測試中，TART提升了GPT-Neo（125M）的表現，使其優於BLOOM（176B），並且與GPT-3（175B）的表現相差不到4%。我們的程式碼和模型可在https://github.com/HazyResearch/TART 找到。

English

Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our analysis actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and propose TART which generically improves an LLM's reasoning abilities using a synthetically trained Transformer-based reasoning module. TART trains this reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, BLOOM), model sizes (100M - 6B), tasks (14 NLP binary classification tasks), and even across different modalities (audio and vision). Additionally, on the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms BLOOM (176B), and is within 4% of GPT-3 (175B). Our code and models are available at https://github.com/HazyResearch/TART .

TART：一個針對任務不可知推理的即插即用Transformer模組

TART: A plug-and-play Transformer module for task-agnostic reasoning

摘要

Support