ChatPaper.aiChatPaper

TemporalBench:为多模态视频模型的细粒度时间理解进行基准测试

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

October 14, 2024
作者: Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang
cs.AI

摘要

理解细粒度时间动态对于多模态视频理解和生成至关重要。由于缺乏细粒度时间标注,现有视频基准大多类似于静态图像基准,并且无法有效评估时间理解模型。在本文中,我们介绍了TemporalBench,这是一个专门用于评估视频中细粒度时间理解的新基准。TemporalBench 包括约10K 个视频问答对,源自约2K 个高质量人类注释,详细描述了视频剪辑中的时间动态。因此,我们的基准提供了一个独特的测试平台,用于评估各种时间理解和推理能力,如动作频率、运动幅度、事件顺序等。此外,它还可以评估各种任务,如视频问答和字幕生成,短视频和长视频理解,以及不同模型,如多模态视频嵌入模型和文本生成模型。结果显示,像 GPT-4o 这样的最先进模型在TemporalBench 上仅实现了38.5% 的问答准确率,显示出人类和人工智能在时间理解方面存在显著差距(约30%)。此外,我们注意到多选题问答存在一个关键缺陷,即LLMs 可以检测到负面字幕中微小变化,并找到中心化描述作为其预测的线索,我们提出了多二元准确率(MBA)来纠正这种偏见。我们希望TemporalBench 能促进改进模型时间推理能力的研究。数据集和评估代码将提供。
English
Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

Summary

AI-Generated Summary

PDF172November 16, 2024