Video-STaR：自我訓練使得能夠利用任何監督進行視頻指導調整

摘要

大視覺語言模型（LVLMs）的表現取決於其訓練數據集的大小和質量。現有的視頻指導調整數據集缺乏多樣性，因為它們是通過提示大型語言模型使用視頻字幕生成問答對來衍生的，因此主要是描述性的。與此同時，存在許多帶有多標籤和監督的標記視頻數據集，但我們發現將它們整合到LVLMs中並不簡單。在這裡，我們提出了具有擴增推理的視頻自我訓練（Video-STaR），這是第一個視頻自我訓練方法。Video-STaR允許利用任何帶有標籤的視頻數據集進行視頻指導調整。在Video-STaR中，LVLM在指導生成和微調之間進行循環，我們展示（I）改善了一般視頻理解，並且（II）使LVLM適應了具有現有監督的新下游任務。在生成過程中，LVLM被提示提出答案。然後僅過濾包含原始視頻標籤的答案，然後對生成的數據集進行重新訓練。通過僅對包含正確視頻標籤的生成答案進行訓練，Video-STaR將這些現有視頻標籤作為視頻指導調整的弱監督。我們的結果表明，Video-STaR增強的LVLM在（I）一般視頻問答方面表現出改善，其中TempCompass的表現提高了10％，以及（II）在下游任務中，Video-STaR將Kinetics700-QA的準確性提高了20％，並將FineDiving的動作質量評估提高了15％。

English

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

Video-STaR：自我訓練使得能夠利用任何監督進行視頻指導調整

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

摘要

Support