TroL：用於大型語言和視覺模型的層遍歷

摘要

大型語言和視覺模型（LLVMs）是由大型語言模型（LLMs）的泛化能力和視覺指導調整的出現推動的。除了直接擴展它們之外，這些模型還使LLVMs能夠通過自然語言指令涵蓋各種任務，展示強大的視覺語言（VL）性能。然而，現有的開源LLVMs，如GPT-4V等性能相當的封閉源LLVMs，通常被認為太大（例如26B、34B和110B參數），擁有更多的層。這些大型模型需要昂貴的高端資源進行訓練和推理。為了解決這個問題，我們提出了一個新的高效LLVM家族，具有1.8B、3.8B和7B的LLM模型大小，稱為層遍歷（TroL），它可以通過令牌方式重複使用層。這種層遍歷技術模擬了回顧和重追答案流的效果，同時增加了前向傳播層的數量，而無需實際添加更多層。我們證明TroL採用了一種簡單的層遍歷方法，但在效率上優於具有更大模型大小的開源LLVMs，並與具有大型尺寸的封閉源LLVMs的性能相匹敵。

English

Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.

TroL：用於大型語言和視覺模型的層遍歷

TroL: Traversal of Layers for Large Language and Vision Models

摘要

Support