ChatPaper.aiChatPaper

E.T. Bench:走向开放式事件级视频-语言理解

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

September 26, 2024
作者: Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
cs.AI

摘要

最近在视频大型语言模型(Video-LLMs)方面取得的进展展示了它们在通用视频理解方面的巨大潜力。为验证这些模型的重要性,已提出了许多基准来诊断它们在不同场景中的能力。然而,现有的基准仅通过视频级别问答来评估模型,缺乏细粒度的事件级别评估和任务多样性。为填补这一空白,我们引入了E.T. Bench(事件级别和时间敏感视频理解基准),这是一个大规模且高质量的基准,用于开放式事件级别视频理解。E.T. Bench分为3级任务分类,包括12个任务下的7.3K个样本,涵盖了8个领域的7K个视频(总长度251.4小时),提供了全面的评估。我们在我们的基准上对8个图像-LLMs和12个视频-LLMs进行了广泛评估,结果显示,针对粗略级别(视频级别)理解的最先进模型难以解决我们的细粒度任务,例如,在视频中定位感兴趣的事件,这在很大程度上是由于视频上下文长度短,时间表示不当以及缺乏多事件训练数据所致。针对这些问题,我们进一步提出了一个强大的基线模型,E.T. Chat,以及一个专为细粒度事件级别理解定制的指导调整数据集E.T. Instruct 164K。我们的简单而有效的解决方案在多种场景中展现出卓越的性能。
English
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

Summary

AI-Generated Summary

PDF72November 16, 2024