TroL: 大規模言語・視覚モデルのためのレイヤートラバーサル

要旨

大規模言語・視覚モデル（LLVM）は、大規模言語モデル（LLM）の汎化能力と視覚的指示チューニングの登場によって推進されてきました。これらのモデルを直接スケールアップすることに加え、自然言語指示を通じて多様なタスクをカバーすることで、LLVMは強力な視覚言語（VL）性能を発揮します。しかし、GPT-4VのようなクローズドソースのLLVMに匹敵する性能を持つ既存のオープンソースLLVMは、しばしば大きすぎると見なされています（例：26B、34B、110Bパラメータ）。これらの大規模モデルは、トレーニングと推論の両方において、高価でハイエンドなリソースを必要とします。この問題に対処するため、我々は1.8B、3.8B、7BのLLMモデルサイズを持つ新しい効率的なLLVMファミリー、Traversal of Layers（TroL）を提案します。TroLは、トークンレベルで層を再利用することを可能にします。この層トラバース技術は、物理的に層を追加することなく、前方伝播層の数を増やしながら、回答ストリームを振り返り再追跡する効果をシミュレートします。我々は、TroLがシンプルな層トラバースアプローチを採用しながらも、より大規模なモデルサイズのオープンソースLLVMを効率的に上回り、実質的なサイズのクローズドソースLLVMの性能に匹敵することを実証します。

English

Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.

TroL: 大規模言語・視覚モデルのためのレイヤートラバーサル

TroL: Traversal of Layers for Large Language and Vision Models

要旨

Support