Video-STaR：自训练使视频指导调整具有任何监督

摘要

大型视觉语言模型（LVLMs）的性能取决于它们的训练数据集的大小和质量。现有的视频指导微调数据集缺乏多样性，因为它们是通过提示大型语言模型使用视频字幕生成问题-答案对而衍生的，因此主要是描述性的。与此同时，存在许多带有多样标签和监督的标记视频数据集，然而，我们发现将它们整合到LVLMs中并不是一件简单的事情。在这里，我们提出了带有增强推理的视频自训练（Video-STaR），这是第一个视频自训练方法。Video-STaR允许利用任何带有标签的视频数据集进行视频指导微调。在Video-STaR中，LVLM在指导生成和微调之间循环，我们展示了（I）改善了一般视频理解，以及（II）使LVLMs适应具有现有监督的新领域任务。在生成过程中，LVLM被提示提出一个答案。然后仅筛选包含原始视频标签的答案，并随后对生成的数据集进行重新训练。通过仅对包含正确视频标签的生成答案进行训练，Video-STaR利用这些现有视频标签作为视频指导微调的弱监督。我们的结果表明，Video-STaR增强的LVLMs在（I）一般视频问答方面表现出改进，其中TempCompass的性能提高了10％，以及（II）在下游任务中，Video-STaR将Kinetics700-QA的准确性提高了20％，并将FineDiving的动作质量评估提高了15％。

English

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.

Video-STaR：自训练使视频指导调整具有任何监督

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

摘要

Support