NVLM: オープンフロンティアクラスのマルチモーダルLLM

要旨

NVLM 1.0を紹介します。これは、最先端のマルチモーダル大規模言語モデル（LLM）ファミリーであり、視覚言語タスクで最先端の結果を達成し、主要なプロプライエタリモデル（例：GPT-4o）やオープンアクセスモデル（例：Llama 3-V 405BおよびInternVL 2）と競り合っています。NVLM 1.0は、マルチモーダルトレーニング後にLLMバックボーンよりも改善されたテキストのみのパフォーマンスを示しています。モデル設計に関しては、デコーダーのみのマルチモーダルLLM（例：LLaVA）とクロスアテンションベースのモデル（例：Flamingo）の包括的な比較を行います。両アプローチの長所と短所に基づいて、トレーニング効率とマルチモーダル推論能力の両方を向上させる新しいアーキテクチャを提案します。さらに、タイルベースのダイナミック高解像度画像用の1-Dタイルタギングデザインを導入し、マルチモーダル推論とOCR関連タスクのパフォーマンスを大幅に向上させます。トレーニングデータに関しては、マルチモーダルの事前トレーニングと監督されたファインチューニングデータセットについて、慎重にキュレーションし詳細な情報を提供します。我々の調査結果は、データセットの品質とタスクの多様性が、すべてのアーキテクチャにおいて、事前トレーニング段階でも規模よりも重要であることを示しています。特筆すべきは、NVLM-1.0モデルの本番向けのマルチモダリティを開発し、視覚言語タスクで優れた成績を収めながら、LLMバックボーンと比較してテキストのみのパフォーマンスを維持、さらに向上させることが可能です。これを実現するために、高品質のテキストのみのデータセットをマルチモーダルトレーニングに組み込み、多様なマス数学と推論データと共に、モダリティ全体で数学とコーディングの能力を向上させます。この分野の研究を推進するために、モデルの重みを公開し、コードをオープンソース化します。詳細はこちら：https://nvlm-project.github.io/。

English

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we are releasing the model weights and will open-source the code for the community: https://nvlm-project.github.io/.

NVLM: オープンフロンティアクラスのマルチモーダルLLM

NVLM: Open Frontier-Class Multimodal LLMs

要旨

Support