MME-VideoOCR: 비디오 시나리오에서 다중모달 LLM의 OCR 기반 능력 평가

초록

멀티모달 대형 언어 모델(MLLMs)은 정적 이미지에서의 광학 문자 인식(OCR)에서 상당한 정확도를 달성했습니다. 그러나 동영상 OCR에서의 효율성은 모션 블러, 시간적 변화, 동영상 콘텐츠에 내재된 시각적 효과와 같은 요인들로 인해 크게 저하됩니다. 실용적인 MLLMs 훈련을 위한 더 명확한 지침을 제공하기 위해, 우리는 다양한 동영상 OCR 응용 시나리오를 포괄하는 MME-VideoOCR 벤치마크를 소개합니다. MME-VideoOCR은 10개의 작업 카테고리로 구성된 25개의 개별 작업과 44개의 다양한 시나리오를 포함합니다. 이러한 작업들은 텍스트 인식을 넘어 동영상 내 텍스트 콘텐츠의 깊은 이해와 추론을 통합합니다. 이 벤치마크는 다양한 해상도, 화면 비율, 지속 시간을 가진 1,464개의 동영상과 2,000개의 정밀하게 선별된 수동 주석 질문-답변 쌍으로 구성됩니다. 우리는 MME-VideoOCR에서 18개의 최신 MLLMs를 평가했으며, 가장 성능이 좋은 모델(Gemini-2.5 Pro)조차도 73.7%의 정확도만 달성하는 것을 확인했습니다. 세부 분석 결과, 기존 MLLMs는 관련 텍스트가 단일 또는 소수의 프레임 내에 포함된 작업에서는 강력한 성능을 보이지만, 전체 동영상 이해를 요구하는 작업을 효과적으로 처리하는 데는 제한된 능력을 보입니다. 이러한 한계는 특히 시공간적 추론, 프레임 간 정보 통합, 언어 사전 편향에 대한 저항이 필요한 시나리오에서 두드러집니다. 우리의 연구 결과는 동적 동영상 시나리오에서 신뢰할 수 있는 OCR을 위해 고해상도 시각적 입력과 충분한 시간적 커버리지의 중요성을 강조합니다.

English

Multimodal Large Language Models (MLLMs) have achieved considerable accuracy in Optical Character Recognition (OCR) from static images. However, their efficacy in video OCR is significantly diminished due to factors such as motion blur, temporal variations, and visual effects inherent in video content. To provide clearer guidance for training practical MLLMs, we introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios. MME-VideoOCR features 10 task categories comprising 25 individual tasks and spans 44 diverse scenarios. These tasks extend beyond text recognition to incorporate deeper comprehension and reasoning of textual content within videos. The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs. We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%. Fine-grained analysis indicates that while existing MLLMs demonstrate strong performance on tasks where relevant texts are contained within a single or few frames, they exhibit limited capability in effectively handling tasks that demand holistic video comprehension. These limitations are especially evident in scenarios that require spatio-temporal reasoning, cross-frame information integration, or resistance to language prior bias. Our findings also highlight the importance of high-resolution visual input and sufficient temporal coverage for reliable OCR in dynamic video scenarios.

MME-VideoOCR: 비디오 시나리오에서 다중모달 LLM의 OCR 기반 능력 평가

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

초록

Support