VistaDPO: 대규모 비디오 모델을 위한 비디오 계층적 시공간 직접 선호 최적화

초록

대형 언어 모델(LLMs)을 기반으로 구축된 대형 비디오 모델(LVMs)은 비디오 이해에서 유망한 성과를 보여왔지만, 종종 인간의 직관과의 불일치 및 비디오 환각 문제로 어려움을 겪습니다. 이러한 문제를 해결하기 위해, 우리는 비디오 계층적 시공간 직접 선호 최적화를 위한 새로운 프레임워크인 VistaDPO를 소개합니다. VistaDPO는 세 가지 계층적 수준에서 텍스트-비디오 선호도를 향상시킵니다: i) 인스턴스 수준, 전체 비디오 내용과 응답을 정렬; ii) 시간적 수준, 비디오의 시간적 의미와 이벤트 설명을 정렬; iii) 지각적 수준, 공간적 객체와 언어 토큰을 정렬. 세밀한 비디오-언어 선호도 정렬을 위한 데이터셋의 부재를 고려하여, 우리는 7.2K QA 쌍으로 구성된 VistaDPO-7k 데이터셋을 구축했습니다. 이 데이터셋은 선택된 응답과 거부된 응답, 타임스탬프, 키프레임, 바운딩 박스와 같은 시공간적 근거 정보를 포함합니다. 비디오 환각, 비디오 QA, 캡션 성능 작업과 같은 벤치마크에서의 광범위한 실험을 통해 VistaDPO가 기존 LVMs의 성능을 크게 향상시키고, 비디오-언어 불일치와 환각 문제를 효과적으로 완화함을 입증했습니다. 코드와 데이터는 https://github.com/HaroldChen19/VistaDPO에서 확인할 수 있습니다.

English

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.