Video-STaR: 任意の監督信号によるビデオ命令チューニングを可能にする自己学習

要旨

大規模視覚言語モデル（LVLM）の性能は、そのトレーニングデータセットの規模と品質に依存します。既存のビデオ指示チューニングデータセットは、ビデオキャプションを用いて大規模言語モデルに質問応答ペアを生成させることで作成されているため、多様性に欠け、主に記述的な内容となっています。一方で、多様なラベルと監督情報を持つ多くのラベル付きビデオデータセットが存在しますが、これらをLVLMに統合することは容易ではありません。本論文では、ビデオ自己学習アプローチとして初めての「Video Self-Training with augmented Reasoning（Video-STaR）」を提案します。Video-STaRは、任意のラベル付きビデオデータセットをビデオ指示チューニングに活用することを可能にします。Video-STaRでは、LVLMが指示生成とファインチューニングを繰り返し行うことで、(I) 一般的なビデオ理解が向上し、(II) 既存の監督情報を用いて新しい下流タスクに適応することを示します。生成フェーズでは、LVLMに回答を提案させます。その後、元のビデオラベルを含む回答のみをフィルタリングし、生成されたデータセットでLVLMを再トレーニングします。正しいビデオラベルを含む生成された回答のみをトレーニングに使用することで、Video-STaRはこれらの既存のビデオラベルを弱い監督としてビデオ指示チューニングに活用します。実験結果は、Video-STaRを適用したLVLMが、(I) 一般的なビデオQAにおいてTempCompassの性能が10%向上し、(II) 下流タスクにおいてKinetics700-QAの精度が20%、FineDivingのアクション品質評価が15%向上することを示しています。

English

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

Video-STaR: 任意の監督信号によるビデオ命令チューニングを可能にする自己学習

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

要旨

Support