LLaVA-NeXT-Interleave:應對大型多模型中的多圖像、視頻和3D
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
July 10, 2024
作者: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
cs.AI
摘要
視覺指導調整在增強大型多模態模型(LMMs)的能力方面取得了顯著進展。然而,現有的開放式LMMs主要專注於單圖像任務,它們在多圖像情境下的應用仍未被充分探索。此外,先前的LMM研究分別處理不同情境,使得無法將新興能力泛化到跨情境。為此,我們介紹了LLaVA-NeXT-Interleave,同時處理LMMs中的多圖像、多幀(視頻)、多視角(3D)和多補丁(單圖像)情境。為了實現這些能力,我們將交錯的數據格式視為一個通用模板,並編制了包含1,177.6k樣本的M4-Instruct數據集,涵蓋了4個主要領域,14個任務和41個數據集。我們還精心編輯了LLaVA-Interleave Bench,全面評估LMMs的多圖像性能。通過大量實驗,LLaVA-NeXT-Interleave在多圖像、視頻和3D基準測試中取得了領先的結果,同時保持了單圖像任務的性能。此外,我們的模型還展現了一些新興能力,例如在不同設置和模態之間轉移任務。代碼可在https://github.com/LLaVA-VL/LLaVA-NeXT找到。
English
Visual instruction tuning has made considerable strides in enhancing the
capabilities of Large Multimodal Models (LMMs). However, existing open LMMs
largely focus on single-image tasks, their applications to multi-image
scenarios remains less explored. Additionally, prior LMM research separately
tackles different scenarios, leaving it impossible to generalize cross
scenarios with new emerging capabilities. To this end, we introduce
LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame
(video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To
enable these capabilities, we regard the interleaved data format as a general
template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4
primary domains with 14 tasks and 41 datasets. We also curate the
LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance
of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading
results in multi-image, video, and 3D benchmarks, while maintaining the
performance of single-image tasks. Besides, our model also exhibits several
emerging capabilities, e.g., transferring tasks across different settings and
modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXTSummary
AI-Generated Summary