使用合成数据进行视频指导调整

摘要

视频大型多模态模型（LMMs）的发展受到了从网络中筛选大量高质量原始数据的困难的阻碍。为了解决这一问题，我们提出了一种替代方法，即专门为视频指令跟随创建高质量合成数据集，即LLaVA-Video-178K。该数据集包括详细字幕生成、开放式问题回答（QA）和多项选择题QA等关键任务。通过在该数据集上训练，结合现有的视觉指令微调数据，我们引入了一种新的视频LMM，即LLaVA-Video。我们的实验表明，LLaVA-Video在各种视频基准测试中取得了出色的性能，突显了我们数据集的有效性。我们计划发布数据集、生成流程以及模型检查点。

English

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

使用合成数据进行视频指导调整

Video Instruction Tuning With Synthetic Data

摘要

Support