ChatPaper.aiChatPaper

使用合成数据进行视频指导调整

Video Instruction Tuning With Synthetic Data

October 3, 2024
作者: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
cs.AI

摘要

视频大型多模态模型(LMMs)的发展受到了从网络中筛选大量高质量原始数据的困难的阻碍。为了解决这一问题,我们提出了一种替代方法,即专门为视频指令跟随创建高质量合成数据集,即LLaVA-Video-178K。该数据集包括详细字幕生成、开放式问题回答(QA)和多项选择题QA等关键任务。通过在该数据集上训练,结合现有的视觉指令微调数据,我们引入了一种新的视频LMM,即LLaVA-Video。我们的实验表明,LLaVA-Video在各种视频基准测试中取得了出色的性能,突显了我们数据集的有效性。我们计划发布数据集、生成流程以及模型检查点。
English
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Summary

AI-Generated Summary

PDF393November 16, 2024