NVILA: 効率的なフロンティア視覚言語モデル

要旨

最近、視覚言語モデル（VLMs）は精度の面で大きな進歩を遂げています。しかし、その効率性にはあまり注目されていません。本論文では、効率性と精度の両方を最適化するために設計されたオープンなVLMsファミリーであるNVILAを紹介します。VILAをベースに構築し、まず空間的および時間的解像度を拡大し、次に視覚トークンを圧縮することで、そのモデルアーキテクチャを改善します。この「拡大してから圧縮する」アプローチにより、NVILAは高解像度画像や長時間のビデオを効率的に処理できます。また、トレーニングやファインチューニングから展開まで、NVILAの効率性を向上させるための体系的な調査も行います。NVILAは、多くの主要なオープンソースおよびプロプライエタリなVLMsに対して、幅広い画像およびビデオのベンチマークで精度を上回るか、それに匹敵します。同時に、トレーニングコストを4.5倍、ファインチューニングのメモリ使用量を3.4倍、プリフィルのレイテンシを1.6〜2.2倍、デコードのレイテンシを1.2〜2.8倍削減します。我々は近日中にコードとモデルを公開し、再現性を促進します。

English

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

NVILA: 効率的なフロンティア視覚言語モデル

NVILA: Efficient Frontier Visual Language Models

要旨

Support