LLaVA-OneVision: 容易な視覚タスク転移

要旨

本論文では、LLaVA-NeXTブログシリーズにおけるデータ、モデル、視覚表現に関する知見を統合して開発したオープンな大規模マルチモーダルモデル（LMM）ファミリーであるLLaVA-OneVisionを紹介する。実験結果から、LLaVA-OneVisionは、単一画像、複数画像、ビデオという3つの重要なコンピュータビジョンシナリオにおいて、オープンLMMの性能限界を同時に押し上げる初の単一モデルであることが示された。特に、LLaVA-OneVisionの設計は、異なるモダリティ/シナリオ間での強力な転移学習を可能にし、新たな能力の出現をもたらす。具体的には、画像からビデオへのタスク転移を通じて、強力なビデオ理解能力とクロスシナリオ能力が実証されている。

English

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

LLaVA-OneVision: 容易な視覚タスク転移

LLaVA-OneVision: Easy Visual Task Transfer

要旨

Support