使用合成数据进行视频指导调整
Video Instruction Tuning With Synthetic Data
October 3, 2024
作者: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
cs.AI
摘要
视频大型多模态模型(LMMs)的发展受到了从网络中筛选大量高质量原始数据的困难的阻碍。为了解决这一问题,我们提出了一种替代方法,即专门为视频指令跟随创建高质量合成数据集,即LLaVA-Video-178K。该数据集包括详细字幕生成、开放式问题回答(QA)和多项选择题QA等关键任务。通过在该数据集上训练,结合现有的视觉指令微调数据,我们引入了一种新的视频LMM,即LLaVA-Video。我们的实验表明,LLaVA-Video在各种视频基准测试中取得了出色的性能,突显了我们数据集的有效性。我们计划发布数据集、生成流程以及模型检查点。
English
The development of video large multimodal models (LMMs) has been hindered by
the difficulty of curating large amounts of high-quality raw data from the web.
To address this, we propose an alternative approach by creating a high-quality
synthetic dataset specifically for video instruction-following, namely
LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning,
open-ended question-answering (QA), and multiple-choice QA. By training on this
dataset, in combination with existing visual instruction tuning data, we
introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that
LLaVA-Video achieves strong performance across various video benchmarks,
highlighting the effectiveness of our dataset. We plan to release the dataset,
its generation pipeline, and the model checkpoints.Summary
AI-Generated Summary