VidText:迈向视频文本理解的全面评估
VidText: Towards Comprehensive Evaluation for Video Text Understanding
May 28, 2025
作者: Zhoufaran Yang, Yan Shu, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, Nicu Sebe
cs.AI
摘要
视频中嵌入的视觉文本承载着丰富的语义信息,这对于整体视频理解以及局部人类行为的细粒度推理都至关重要。然而,现有的视频理解基准数据集大多忽视了文本信息,而专注于OCR的基准数据集又仅限于静态图像,这限制了它们捕捉文本与动态视觉环境之间交互的能力。为填补这一空白,我们提出了VidText,一个旨在全面深入评估视频文本理解的新基准。VidText具备以下关键特点:1) 它涵盖了广泛的现实场景并支持多语言内容,囊括了视频文本自然出现的多样化环境。2) 它引入了一个分层次的评估框架,包含视频级、片段级和实例级任务,既能评估全局概括能力,也能测试局部检索性能。3) 该基准还设置了一系列配对的感知推理任务,从视觉文本感知到文本与视觉信息间的跨模态推理。基于18种最先进的大型多模态模型(LMMs)的广泛实验表明,当前模型在多数任务上表现欠佳,存在显著的提升空间。进一步分析揭示了模型内在因素(如输入分辨率和OCR能力)与外部因素(如辅助信息的使用和思维链推理策略)的影响。我们期望VidText能够填补当前视频理解基准的空白,并为未来在动态环境中进行多模态视频文本推理的研究奠定基础。
English
Visual texts embedded in videos carry rich semantic information, which is
crucial for both holistic video understanding and fine-grained reasoning about
local human actions. However, existing video understanding benchmarks largely
overlook textual information, while OCR-specific benchmarks are constrained to
static images, limiting their ability to capture the interaction between text
and dynamic visual contexts. To address this gap, we propose VidText, a new
benchmark designed for comprehensive and in-depth evaluation of video text
understanding. VidText offers the following key features: 1) It covers a wide
range of real-world scenarios and supports multilingual content, encompassing
diverse settings where video text naturally appears. 2) It introduces a
hierarchical evaluation framework with video-level, clip-level, and
instance-level tasks, enabling assessment of both global summarization and
local retrieval capabilities. 3) The benchmark also introduces a set of paired
perception reasoning tasks, ranging from visual text perception to cross-modal
reasoning between textual and visual information. Extensive experiments on 18
state-of-the-art Large Multimodal Models (LMMs) reveal that current models
struggle across most tasks, with significant room for improvement. Further
analysis highlights the impact of both model-intrinsic factors, such as input
resolution and OCR capability, and external factors, including the use of
auxiliary information and Chain-of-Thought reasoning strategies. We hope
VidText will fill the current gap in video understanding benchmarks and serve
as a foundation for future research on multimodal reasoning with video text in
dynamic environments.Summary
AI-Generated Summary