ChatPaper.aiChatPaper

LLaVA-OneVision:简单的视觉任务迁移

LLaVA-OneVision: Easy Visual Task Transfer

August 6, 2024
作者: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
cs.AI

摘要

我们提出了LLaVA-OneVision,这是一个由我们在LLaVA-NeXT博客系列中对数据、模型和视觉表示的洞见进行整合开发的开放大型多模态模型(LMM)系列。我们的实验结果表明,LLaVA-OneVision是第一个能够同时推动开放LMM在三个重要计算机视觉场景(单图像、多图像和视频场景)性能边界的单一模型。LLaVA-OneVision的设计允许在不同模态/场景之间进行强大的迁移学习,产生新的新兴能力。特别是,通过从图像到视频的任务迁移展示了强大的视频理解和跨场景能力。
English
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
PDF612November 28, 2024