VidText: 비디오 텍스트 이해를 위한 포괄적 평가 프레임워크

초록

비디오에 내재된 시각적 텍스트는 풍부한 의미 정보를 담고 있으며, 이는 비디오 전반적 이해와 지역적 인간 행동에 대한 세밀한 추론 모두에 있어 핵심적입니다. 그러나 기존의 비디오 이해 벤치마크는 텍스트 정보를 크게 간과하고 있으며, OCR 전용 벤치마크는 정적 이미지에 한정되어 있어 텍스트와 동적 시각적 맥락 간의 상호작용을 포착하는 데 한계가 있습니다. 이러한 격차를 해소하기 위해, 우리는 비디오 텍스트 이해를 포괄적이고 심층적으로 평가하기 위한 새로운 벤치마크인 VidText를 제안합니다. VidText는 다음과 같은 주요 특징을 제공합니다: 1) 다양한 실제 시나리오를 다루고 다국어 콘텐츠를 지원하여, 비디오 텍스트가 자연스럽게 등장하는 다양한 환경을 포괄합니다. 2) 비디오 수준, 클립 수준, 인스턴스 수준의 과제로 구성된 계층적 평가 프레임워크를 도입하여, 전역적 요약 능력과 지역적 검색 능력을 모두 평가할 수 있습니다. 3) 이 벤치마크는 시각적 텍스트 인식부터 텍스트와 시각 정보 간의 크로스모달 추론에 이르는 일련의 짝을 이룬 인지 추론 과제를 도입합니다. 18개의 최신 대형 멀티모달 모델(LMM)에 대한 광범위한 실험 결과, 현재 모델들은 대부분의 과제에서 어려움을 겪으며 개선의 여지가 크다는 것이 드러났습니다. 추가 분석에서는 입력 해상도와 OCR 능력과 같은 모델 내적 요인과, 보조 정보의 사용 및 사고의 연쇄(Chain-of-Thought) 추론 전략과 같은 외적 요인의 영향이 강조되었습니다. 우리는 VidText가 현재의 비디오 이해 벤치마크 격차를 메우고, 동적 환경에서의 비디오 텍스트를 활용한 멀티모달 추론 연구의 기반이 되기를 바랍니다.

English

Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

VidText: 비디오 텍스트 이해를 위한 포괄적 평가 프레임워크

VidText: Towards Comprehensive Evaluation for Video Text Understanding

초록

Support