비디오로 생각하기: 유망한 다중모드 추론 패러다임으로서의 비디오 생성

초록

'텍스트 기반 사고'와 '이미지 기반 사고' 패러다임은 대규모 언어 모델(LLM)과 시각 언어 모델(VLM)의 추론 능력을 크게 향상시킵니다. 그러나 이러한 패러다임에는 본질적인 한계가 존재합니다. (1) 이미지는 단순히 단일 순간만을 포착하여 동적 과정이나 연속적인 변화를 표현하지 못하며, (2) 텍스트와 시각을 별개의 양태로 분리함으로써 통합된 다중양식 이해와 생성을 저해합니다. 이러한 한계를 극복하기 위해 우리는 Sora-2와 같은 비디오 생성 모델을 활용하여 시각적 추론과 텍스트 추론을 통일된 시간적 프레임워크 내에서 연결하는 새로운 패러다임인 '비디오 기반 사고(Thinking with Video)'를 소개합니다. 이러한 탐구를 지원하기 위해 우리는 비디오 사고 벤치마크(VideoThinkBench)를 개발했습니다. VideoThinkBench는 (1) 시각 중심 과제(예: 눈대중 추측 퍼즐)와 (2) 텍스트 중심 과제(예: GSM8K, MMMU의 하위 집합)라는 두 가지 과제 범주를 포괄합니다. 우리의 평가 결과, Sora-2는 능력 있는 추론자로 확인되었습니다. 시각 중심 과제에서는 일반적으로 최첨단(SOTA) VLM과 비슷한 성능을 보였으며, 눈대중 게임과 같은 몇몇 과제에서는 VLM을 능가하기도 했습니다. 텍스트 중심 과제에서는 MATH에서 92%, MMMU에서 75.53%의 정확도를 달성했습니다. 더 나아가 우리는 이러한 능력의 근원을 체계적으로 분석했습니다. 또한 자기 일관성과 문맥 학습이 Sora-2의 성능을 향상시킬 수 있음을 발견했습니다. 요약하면, 우리의 연구 결과는 비디오 생성 모델이 잠재적인 통합 다중양식 이해 및 생성 모델이며, '비디오 기반 사고'를 통합된 다중양식 추론 패러다임으로 위치시킵니다.

English

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

비디오로 생각하기: 유망한 다중모드 추론 패러다임으로서의 비디오 생성

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

초록

Support