大規模ビデオ言語モデルの自己整合化：洗練された正則化選好最適化によるアプローチ

要旨

大規模ビデオ言語モデル（LVLM）の最近の進展にもかかわらず、これらのモデルは依然として細かい時間的理解に苦戦し、幻覚を起こし、単純なビデオ質問応答タスクでさえ簡単なミスを犯すことが多く、これらは実世界のアプリケーションにおける安全で信頼性の高い展開に重大な課題を提起しています。これらの制限に対処するため、我々はLVLMが自身の誤りから学習することを可能にする自己整合フレームワークを提案します。提案するフレームワークはまず、好ましい応答と好ましくない応答のペアのトレーニングセットを取得します。ここで、好ましくない応答は、不十分な時空間的理解、共起する概念間の誤った相関、視覚モダリティを無視した言語的キューへの過度の依存などによって頻繁に発生する一般的なエラーパターンを組み込んで生成されます。構築された好ましい応答と好ましくない応答のペアを用いてLVLMの自己整合を促進するために、我々はRefined Regularized Preference Optimization（RRPO）を導入します。これは、Direct Preference Optimization（DPO）の限界を解決するために、サブシーケンスレベルの洗練された報酬とトークンレベルのKL正則化を利用する新しい選好最適化手法です。我々は、RRPOがDPOと比較してより正確な整合とより安定したトレーニングを達成することを実証します。我々の実験と分析は、ビデオ幻覚、短編および長編ビデオの理解、細かい時間的推論を含む多様なビデオタスクにおけるアプローチの有効性を検証します。

English

Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.

大規模ビデオ言語モデルの自己整合化：洗練された正則化選好最適化によるアプローチ

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

要旨

Support