LLaVA-NeXT-Interleave:应对大型多模态模型中的多图像、视频和3D
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
July 10, 2024
作者: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
cs.AI
摘要
视觉指导调整在增强大型多模态模型(LMMs)的能力方面取得了长足进展。然而,现有的开放式LMMs主要关注单图像任务,它们在多图像场景中的应用仍未得到充分探讨。此外,先前的LMM研究分别处理不同场景,导致无法将新兴能力泛化到跨场景中。为此,我们引入LLaVA-NeXT-Interleave,同时处理LMMs中的多图像、多帧(视频)、多视角(3D)和多块(单图像)场景。为实现这些能力,我们将交错数据格式视为通用模板,并编制了包含1,177.6k样本的M4-Instruct数据集,涵盖4个主要领域,14个任务和41个数据集。我们还策划了LLaVA-Interleave Bench,全面评估LMMs的多图像性能。通过大量实验,LLaVA-NeXT-Interleave在多图像、视频和3D基准测试中取得领先成绩,同时保持单图像任务的性能。此外,我们的模型还展示了一些新兴能力,例如在不同设置和模态之间转移任务。代码可在https://github.com/LLaVA-VL/LLaVA-NeXT找到。
English
Visual instruction tuning has made considerable strides in enhancing the
capabilities of Large Multimodal Models (LMMs). However, existing open LMMs
largely focus on single-image tasks, their applications to multi-image
scenarios remains less explored. Additionally, prior LMM research separately
tackles different scenarios, leaving it impossible to generalize cross
scenarios with new emerging capabilities. To this end, we introduce
LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame
(video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To
enable these capabilities, we regard the interleaved data format as a general
template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4
primary domains with 14 tasks and 41 datasets. We also curate the
LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance
of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading
results in multi-image, video, and 3D benchmarks, while maintaining the
performance of single-image tasks. Besides, our model also exhibits several
emerging capabilities, e.g., transferring tasks across different settings and
modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXTSummary
AI-Generated Summary