OpenVision 2: マルチモーダル学習のための生成事前学習視覚エンコーダファミリー

要旨

本論文では、OpenVisionのアーキテクチャと損失設計を簡素化し、その学習効率を向上させる手法を提案する。先行研究であるCapPaやAIMv2といった視覚-言語事前学習モデル、およびLLaVAなどの現代的なマルチモーダル設計に倣い、我々の変更はシンプルである：テキストエンコーダ（およびそれに伴うコントラスティブ損失）を除去し、純粋に生成的な学習信号としてキャプション損失のみを保持する。この新バージョンをOpenVision 2と命名する。初期結果は有望であり、この簡素化にもかかわらず、OpenVision 2は幅広いマルチモーダルベンチマークにおいて元のモデルの性能に匹敵しつつ、学習時間とメモリ消費量を大幅に削減している。例えば、ViT-L/14を使用した場合、学習時間は約1.5倍（83時間から57時間）、メモリ使用量は約1.8倍（24.5GBから13.8GB、最大バッチサイズを2kから8kに拡大可能）削減される。この優れた学習効率により、OpenVisionで使用された最大の視覚エンコーダをはるかに超えるスケール、10億パラメータ以上に到達することが可能となった。我々は、この軽量で生成のみに特化したパラダイムが、マルチモーダル基盤モデルにおける将来の視覚エンコーダ開発にとって非常に魅力的であると強く確信している。

English

This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

OpenVision 2: マルチモーダル学習のための生成事前学習視覚エンコーダファミリー

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

要旨

Support