VidText: ビデオテキスト理解のための包括的評価に向けて

要旨

動画に埋め込まれた視覚的テキストは、豊富な意味情報を有しており、動画全体の理解と局所的な人間の行動に関する詳細な推論の両方において重要な役割を果たします。しかし、既存の動画理解ベンチマークはテキスト情報をほとんど考慮しておらず、OCRに特化したベンチマークは静止画像に限定されているため、テキストと動的な視覚的コンテキスト間の相互作用を捉える能力が制限されています。このギャップを埋めるため、我々はVidTextという新しいベンチマークを提案します。VidTextは、動画テキスト理解の包括的かつ深い評価を目的として設計されており、以下の特徴を備えています：1) 現実世界の多様なシナリオをカバーし、多言語コンテンツをサポートすることで、動画テキストが自然に現れる多様な設定を包含します。2) 動画レベル、クリップレベル、インスタンスレベルの階層的な評価フレームワークを導入し、全体の要約能力と局所的な検索能力の両方を評価可能にします。3) 視覚的テキストの知覚からテキストと視覚情報のクロスモーダル推論まで、一連のペアになった知覚推論タスクを導入します。18の最先端大規模マルチモーダルモデル（LMM）を用いた広範な実験により、現在のモデルはほとんどのタスクで苦戦しており、改善の余地が大きいことが明らかになりました。さらに、入力解像度やOCR能力などのモデル固有の要因と、補助情報の使用やChain-of-Thought推論戦略などの外部要因の影響を分析しました。VidTextが、動的環境における動画テキストを用いたマルチモーダル推論の未来の研究の基盤となり、現在の動画理解ベンチマークのギャップを埋めることを期待しています。

English

Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

VidText: ビデオテキスト理解のための包括的評価に向けて

VidText: Towards Comprehensive Evaluation for Video Text Understanding

要旨

Support