MMWorld: 비디오 기반 다학제적 다면적 세계 모델 평가를 향하여

초록

멀티모달 언어 모델(MLLMs)은 "세계 모델"의 새로운 능력, 즉 복잡한 현실 세계의 역학을 해석하고 추론하는 능력을 보여줍니다. 이러한 능력을 평가하기 위해 우리는 비디오가 현실 세계의 역학과 인과관계를 풍부하게 담아내는 이상적인 매체라고 주장합니다. 이를 위해 우리는 다학제적이고 다면적인 멀티모달 비디오 이해를 위한 새로운 벤치마크인 MMWorld를 소개합니다. MMWorld는 두 가지 독특한 장점으로 기존의 비디오 이해 벤치마크와 차별화됩니다: (1) 다학제적 접근으로, 종종 도메인 전문 지식이 필요한 다양한 학문 분야를 포괄하며; (2) 다면적 추론으로, 설명, 반사실적 사고, 미래 예측 등을 포함합니다. MMWorld는 전체 비디오에 대한 질문으로 MLLMs를 평가하기 위한 인간이 주석을 단 데이터셋과 단일 지각 모달리티 내에서 MLLMs를 분석하기 위한 합성 데이터셋으로 구성됩니다. MMWorld는 총 7개의 주요 학문 분야와 69개의 하위 분야에 걸쳐 1,910개의 비디오와 6,627개의 질문-답변 쌍 및 관련 캡션을 포함합니다. 평가에는 2개의 독점 모델과 10개의 오픈소스 MLLMs가 포함되었으며, 이들은 MMWorld에서 어려움을 겪습니다(예: GPT-4V가 가장 좋은 성적을 보였지만 정확도는 52.3%에 불과함). 이는 개선의 여지가 크다는 것을 보여줍니다. 추가적인 절제 연구는 모델들이 인간과 다른 기술 세트를 가지고 있다는 흥미로운 발견을 드러냅니다. 우리는 MMWorld가 비디오에서의 세계 모델 평가를 위한 중요한 단계로 자리 잡기를 바랍니다.

English

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.

MMWorld: 비디오 기반 다학제적 다면적 세계 모델 평가를 향하여

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

초록

Support