TemporalBench：マルチモーダルビデオモデルの細かい時間理解のためのベンチマーク化

要旨

多様なモーダルビデオの理解と生成において、微細な時間ダイナミクスを理解することは重要です。微細な時間アノテーションの不足により、既存のビデオベンチマークは主に静止画像ベンチマークに似ており、時間理解モデルの評価には不適格です。本論文では、ビデオ内の微細な時間理解を評価するために専用の新しいベンチマークであるTemporalBenchを紹介します。TemporalBenchは、ビデオクリップ内の時間ダイナミクスを詳細に示す約2,000の高品質な人間のアノテーションから派生した約10,000のビデオ質問回答ペアで構成されています。このため、当社のベンチマークは、アクション頻度、モーションの大きさ、イベントの順序などの様々な時間理解および推論能力の評価に対する独自のテストベッドを提供します。さらに、ビデオ質問応答やキャプショニング、短いビデオ理解や長いビデオ理解など、さまざまなタスクや、マルチモーダルビデオ埋め込みモデルやテキスト生成モデルなどの異なるモデルの評価を可能にします。結果は、GPT-4oなどの最先端モデルがTemporalBenchで質問回答の正解率がわずか38.5%しか達成していないことを示し、時間理解において人間とAIの間に30%の大きなギャップがあることを示しています。さらに、LLMが否定的なキャプションの微妙な変化を検出し、予測の手掛かりとして中央集権的な説明を見つける多選択QAの重要な落とし穴に気付き、その偏りを修正するためにMultiple Binary Accuracy（MBA）を提案しています。TemporalBenchがモデルの時間推論能力の向上に関する研究を促進することを期待しています。データセットと評価コードは公開されます。

English

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

TemporalBench：マルチモーダルビデオモデルの細かい時間理解のためのベンチマーク化

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

要旨

Support