合成データを用いたビデオ指示の調整

要旨

ビデオ大規模マルチモーダルモデル（LMMs）の開発は、ウェブから大量で高品質な生データを収集する難しさによって妨げられてきました。この課題に対処するために、私たちはビデオ指示に従うための高品質な合成データセット、具体的にはLLaVA-Video-178Kを作成することで代替手法を提案します。このデータセットには、詳細なキャプショニング、オープンエンドの質疑応答（QA）、および多肢選択式QAなどの主要なタスクが含まれています。このデータセットでのトレーニングを行い、既存のビジュアル指示チューニングデータと組み合わせることで、新しいビデオLMMであるLLaVA-Videoを導入します。私たちの実験は、LLaVA-Videoがさまざまなビデオベンチマークで強力なパフォーマンスを達成し、当該データセットの効果を示しています。私たちは、データセット、その生成パイプライン、およびモデルのチェックポイントを公開する予定です。

English

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

合成データを用いたビデオ指示の調整

Video Instruction Tuning With Synthetic Data

要旨

Support