TroL：大型语言和视觉模型的层遍历

摘要

大型语言和视觉模型（LLVMs）受到大型语言模型（LLMs）的泛化能力以及视觉指导调整的推动。除了直接扩大规模外，这些模型使LLVMs能够通过自然语言指令涵盖各种任务，展示强大的视觉语言（VL）性能。然而，现有的开源LLVMs，如GPT-4V等性能相当的闭源LLVMs，通常被认为太大（例如26B、34B和110B参数），具有更多的层。这些大型模型需要昂贵的高端资源进行训练和推断。为了解决这个问题，我们提出了一种新的高效LLVM家族，具有1.8B、3.8B和7B的LLM模型大小，名为层遍历（TroL），它可以以令牌方式重复使用层。这种层遍历技术模拟了回顾和重追答案流的效果，同时增加了前向传播层的数量，而无需物理上添加更多层。我们证明TroL采用简单的层遍历方法，却能有效地胜过具有更大模型大小的开源LLVMs，并与具有实质大小的闭源LLVMs的性能相匹敌。

English

Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.

TroL：大型语言和视觉模型的层遍历

TroL: Traversal of Layers for Large Language and Vision Models

摘要

Support