Goldfish: 任意の長さのビデオに対する視覚言語理解

要旨

現在のLLMベースのビデオ理解モデルの多くは、数分以内にビデオを処理することが可能です。しかし、長時間のビデオに対しては、「ノイズと冗長性」や「メモリと計算」の制約といった課題に直面しています。本論文では、任意の長さのビデオを理解するために特化した手法であるGoldfishを提案します。また、ビジョンとテキストの内容に関する質問を通じて、長時間ビデオの理解能力を評価するために設計されたTVQA-longベンチマークを紹介します。Goldfishは、効率的な検索メカニズムを用いて、最初に指示に関連するトップkのビデオクリップを収集し、その後、所望の応答を提供します。この検索メカニズムの設計により、Goldfishは任意の長さのビデオシーケンスを効率的に処理し、映画やテレビシリーズなどのコンテキストでの応用を可能にします。検索プロセスを支援するために、ビデオクリップの詳細な説明を生成するMiniGPT4-Videoを開発しました。長時間ビデオ評価のためのベンチマークの不足に対処するため、TVQA短編ビデオベンチマークを拡張コンテンツ分析用に適応させ、エピソード全体からの質問を集約することで、評価を部分的な理解からエピソード全体の理解にシフトしました。TVQA-longベンチマークで41.78%の精度を達成し、従来の手法を14.94%上回りました。また、MiniGPT4-Videoは短編ビデオ理解においても優れた性能を示し、MSVD、MSRVTT、TGIF、TVQA短編ビデオベンチマークでそれぞれ3.23%、2.03%、16.5%、23.59%の向上を記録しました。これらの結果は、我々のモデルが長時間および短時間ビデオの理解において大幅な改善を達成したことを示しています。我々のモデルとコードはhttps://vision-cair.github.io/Goldfish_website/で公開されています。

English

Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as "noise and redundancy", as well as "memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television series. To facilitate the retrieval process, we developed MiniGPT4-Video that generates detailed descriptions for the video clips. In addressing the scarcity of benchmarks for long video evaluation, we adapted the TVQA short video benchmark for extended content analysis by aggregating questions from entire episodes, thereby shifting the evaluation from partial to full episode comprehension. We attained a 41.78% accuracy rate on the TVQA-long benchmark, surpassing previous methods by 14.94%. Our MiniGPT4-Video also shows exceptional performance in short video comprehension, exceeding existing state-of-the-art methods by 3.23%, 2.03%, 16.5% and 23.59% on the MSVD, MSRVTT, TGIF, and TVQA short video benchmarks, respectively. These results indicate that our models have significant improvements in both long and short-video understanding. Our models and code have been made publicly available at https://vision-cair.github.io/Goldfish_website/

Goldfish: 任意の長さのビデオに対する視覚言語理解

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

要旨

Support