비디오에스프레소: 핵심 프레임 선택을 통한 세밀한 비디오 추론을 위한 대규모 체인 오브 씨쓰 데이터셋

초록

대규모 비전 언어 모델(LVLMs)의 발전은 다중 모달 이해를 크게 향상시켰지만, 고품질 대규모 데이터셋의 부족으로 비디오 추론 작업에서 여전히 도전이 남아 있습니다. 기존 비디오 질의응답(VideoQA) 데이터셋은 종종 고품질의 수동 주석이 부족하거나 중복된 프레임별 분석을 사용하는 자동 생성 방법에 의존하여 확장성과 복잡한 추론에 대한 효과를 제한합니다. 이러한 도전에 대처하기 위해 우리는 VideoEspresso를 소개합니다. 이는 중요한 공간적 세부사항과 시간적 일관성을 보존하는 VideoQA 쌍과 중간 추론 단계의 다중 모달 주석을 특징으로 하는 새로운 데이터셋입니다. 저희의 구축 파이프라인은 중복성을 줄이기 위한 의미론적인 방법을 사용하며, GPT-4o를 사용하여 QA 쌍을 생성합니다. 또한 비디오 Chain-of-Thought(CoT) 주석을 개발하여 추론 과정을 풍부하게 하고, GPT-4o가 QA 쌍과 비디오 콘텐츠에서 논리적 관계를 추출하도록 안내합니다. 고품질 VideoQA 쌍의 잠재력을 활용하기 위해, 우리는 Frame Selector 및 두 단계의 지시 fine-tuned 추론 LVLM을 특징으로 하는 Hybrid LVLMs 협업 프레임워크를 제안합니다. 이 프레임워크는 핵심 프레임을 선택하고 다중 모달 증거를 사용하여 CoT 추론을 수행합니다. 우리의 제안된 벤치마크에서 14가지 작업에 대해 9가지 인기 있는 LVLMs와 비교하여 평가한 결과, 대부분의 작업에서 기존 기준선을 능가하는 우리의 방법은 우수한 비디오 추론 능력을 보여줍니다. 우리의 코드와 데이터셋은 다음에서 공개될 예정입니다: https://github.com/hshjerry/VideoEspresso

English

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

비디오에스프레소: 핵심 프레임 선택을 통한 세밀한 비디오 추론을 위한 대규모 체인 오브 씨쓰 데이터셋

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

초록

Support