VistaDPO：大規模ビデオモデルのための階層的時空間直接選好最適化

要旨

大規模言語モデル（LLM）を基盤とした大規模ビデオモデル（LVM）は、ビデオ理解において有望な成果を示しているが、人間の直感との不一致やビデオ幻覚の問題に悩まされることが多い。これらの課題に対処するため、我々はVistaDPOという新しいフレームワークを提案する。VistaDPOは、ビデオの階層的時空間的直接選好最適化（Video Hierarchical Spatial-Temporal Direct Preference Optimization）を実現し、テキストとビデオの選好整合性を3つの階層レベルで強化する。i) インスタンスレベル：ビデオ全体の内容と応答を整合させる、ii) 時間レベル：ビデオの時間的セマンティクスとイベント記述を整合させる、iii) 知覚レベル：空間的オブジェクトと言語トークンを整合させる。細粒度のビデオと言語の選好整合性を評価するためのデータセットが不足していることを踏まえ、我々はVistaDPO-7kを構築した。これは7.2KのQAペアからなり、選択された応答と拒否された応答、タイムスタンプ、キーフレーム、バウンディングボックスなどの時空間的グラウンディング情報が注釈付けされている。ビデオ幻覚、ビデオQA、キャプショニング性能タスクなどのベンチマークでの広範な実験により、VistaDPOが既存のLVMの性能を大幅に向上させ、ビデオと言語の不一致や幻覚を効果的に軽減することが実証された。コードとデータはhttps://github.com/HaroldChen19/VistaDPOで公開されている。

English

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.