VidText:邁向視頻文本理解的全面評估
VidText: Towards Comprehensive Evaluation for Video Text Understanding
May 28, 2025
作者: Zhoufaran Yang, Yan Shu, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, Nicu Sebe
cs.AI
摘要
嵌入視頻中的視覺文本承載著豐富的語義信息,這對於整體視頻理解以及對局部人類行為的精細推理至關重要。然而,現有的視頻理解基準大多忽視了文本信息,而專注於OCR的基準則局限於靜態圖像,限制了其捕捉文本與動態視覺情境之間互動的能力。為填補這一空白,我們提出了VidText,一個旨在全面深入評估視頻文本理解的新基準。VidText具備以下關鍵特徵:1)它涵蓋了廣泛的真實世界場景並支持多語言內容,囊括了視頻文本自然出現的多樣化環境。2)它引入了一個分層評估框架,包含視頻級、片段級和實例級任務,能夠評估全局概括與局部檢索能力。3)該基準還引入了一系列配對的感知推理任務,從視覺文本感知到文本與視覺信息之間的跨模態推理。對18種最先進的大型多模態模型(LMMs)進行的廣泛實驗表明,當前模型在大多數任務上表現欠佳,存在顯著的改進空間。進一步的分析強調了模型內在因素(如輸入分辨率和OCR能力)與外部因素(包括輔助信息的使用和思維鏈推理策略)的影響。我們希望VidText能夠填補當前視頻理解基準的空白,並為未來在動態環境中進行多模態推理的視頻文本研究奠定基礎。
English
Visual texts embedded in videos carry rich semantic information, which is
crucial for both holistic video understanding and fine-grained reasoning about
local human actions. However, existing video understanding benchmarks largely
overlook textual information, while OCR-specific benchmarks are constrained to
static images, limiting their ability to capture the interaction between text
and dynamic visual contexts. To address this gap, we propose VidText, a new
benchmark designed for comprehensive and in-depth evaluation of video text
understanding. VidText offers the following key features: 1) It covers a wide
range of real-world scenarios and supports multilingual content, encompassing
diverse settings where video text naturally appears. 2) It introduces a
hierarchical evaluation framework with video-level, clip-level, and
instance-level tasks, enabling assessment of both global summarization and
local retrieval capabilities. 3) The benchmark also introduces a set of paired
perception reasoning tasks, ranging from visual text perception to cross-modal
reasoning between textual and visual information. Extensive experiments on 18
state-of-the-art Large Multimodal Models (LMMs) reveal that current models
struggle across most tasks, with significant room for improvement. Further
analysis highlights the impact of both model-intrinsic factors, such as input
resolution and OCR capability, and external factors, including the use of
auxiliary information and Chain-of-Thought reasoning strategies. We hope
VidText will fill the current gap in video understanding benchmarks and serve
as a foundation for future research on multimodal reasoning with video text in
dynamic environments.Summary
AI-Generated Summary