VISTA：ビデオ時空間拡張による長時間および高解像度ビデオ理解の向上

要旨

現在の大規模な多モーダルモデル（LMMs）は、長時間または高解像度のビデオを処理し理解する際に重要な課題に直面しており、これは主に高品質なデータセットの不足に起因しています。この問題にデータ中心の視点から取り組むために、我々はVISTAを提案します。これは、既存のビデオキャプションデータセットから長時間および高解像度のビデオ指示に従うペアを合成するシンプルで効果的なビデオ時空間拡張フレームワークです。VISTAは、ビデオを時空的に組み合わせて、新しい合成ビデオを作成し、それらのビデオに関連する質問と回答のペアを生成します。このパラダイムに基づき、我々は7つのビデオ拡張手法を開発し、長時間および高解像度のビデオ理解を向上させることを目的としたビデオ指示に従うデータセットであるVISTA-400Kを編纂します。私たちのデータでさまざまなビデオLMMsをファインチューニングした結果、長いビデオ理解の4つの厳しいベンチマーク全体で平均3.3%の改善が得られました。さらに、我々は、高解像度ビデオ理解の包括的なベンチマークであるHRVideoBenchを導入し、我々のファインチューニングモデルが6.5%の性能向上を達成したことを示します。これらの結果は、当社のフレームワークの有効性を示しています。

English

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

VISTA：ビデオ時空間拡張による長時間および高解像度ビデオ理解の向上

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

要旨

Support