TemporalBench：為多模態視頻模型進行細粒度時間理解的基準測試

摘要

瞭解細粒度時間動態對於多模態視頻理解和生成至關重要。由於缺乏細粒度時間標註，現有的視頻基準大多類似於靜態圖像基準，無法評估對於時間理解的模型。在本文中，我們介紹了TemporalBench，這是一個專門用於評估視頻中細粒度時間理解的新基準。TemporalBench 包含約 10K 個視頻問答對，來自約 2K 個高質量人類標註，詳細描述了視頻片段中的時間動態。因此，我們的基準提供了一個獨特的測試平臺，用於評估各種時間理解和推理能力，如動作頻率、運動幅度、事件順序等。此外，它還可以評估各種任務，如視頻問答和字幕生成，短視頻和長視頻理解，以及不同的模型，如多模態視頻嵌入模型和文本生成模型。結果顯示，像 GPT-4o 這樣的最先進模型在 TemporalBench 上僅達到 38.5% 的問答準確率，顯示人類和 AI 在時間理解方面存在顯著差距（約 30%）。此外，我們注意到多選問答存在一個關鍵缺陷，即 LLMs 可以檢測到負面字幕中微小變化並找到集中描述作為預測的線索，我們提出了多重二元準確度（MBA）來糾正這種偏見。我們希望 TemporalBench 能促進改進模型時間推理能力的研究。數據集和評估代碼將提供。

English

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

TemporalBench：為多模態視頻模型進行細粒度時間理解的基準測試

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

摘要

Support