ChatPaper.aiChatPaper

LLaVA-OneVision:簡單的視覺任務轉移

LLaVA-OneVision: Easy Visual Task Transfer

August 6, 2024
作者: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
cs.AI

摘要

我們提出 LLaVA-OneVision,這是一系列開放式大型多模型(LMMs),通過整合我們在 LLaVA-NeXT 博客系列中對數據、模型和視覺表示的見解而開發的。我們的實驗結果表明,LLaVA-OneVision 是第一個能夠同時推動開放式 LMMs 在三個重要的計算機視覺場景中性能邊界的單一模型:單圖像、多圖像和視頻場景。LLaVA-OneVision 的設計允許在不同模態/場景之間進行強大的遷移學習,產生新的新興能力。特別是,通過從圖像到視頻的任務轉移展示了強大的視頻理解和跨場景能力。
English
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Summary

AI-Generated Summary

PDF612November 28, 2024