视频大尺度多模态模型的直接偏好优化:基于语言模型奖励
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
April 1, 2024
作者: Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang
cs.AI
摘要
偏好建模技术,如直接偏好优化(DPO),已显示出在增强大型语言模型(LLM)泛化能力方面的有效性。然而,在涉及视频指令跟随的任务中,提供信息反馈,尤其是检测生成响应中的幻觉,仍然是一个重大挑战。先前研究探索了使用大型多模态模型(LMMs)作为奖励模型来指导偏好建模,但其评估生成响应与相应视频事实一致性的能力尚未得到确切证实。本文引入了一种新颖框架,利用详细的视频字幕作为视频内容的代理,使语言模型能够将此信息作为评分视频问答(QA)预测的支持证据。我们的方法展示了与OpenAI GPT-4V模型奖励机制的稳健一致性,该机制直接以视频帧为输入。此外,我们表明,通过DPO应用这种定制奖励显著提升了视频LMMs在视频QA任务中的表现。
English
Preference modeling techniques, such as direct preference optimization (DPO),
has shown effective in enhancing the generalization abilities of large language
model (LLM). However, in tasks involving video instruction-following, providing
informative feedback, especially for detecting hallucinations in generated
responses, remains a significant challenge. Previous studies have explored
using large large multimodal models (LMMs) as reward models to guide preference
modeling, but their ability to accurately assess the factuality of generated
responses compared to corresponding videos has not been conclusively
established. This paper introduces a novel framework that utilizes detailed
video captions as a proxy of video content, enabling language models to
incorporate this information as supporting evidence for scoring video Question
Answering (QA) predictions. Our approach demonstrates robust alignment with
OpenAI GPT-4V model's reward mechanism, which directly takes video frames as
input. Furthermore, we show that applying this tailored reward through DPO
significantly improves the performance of video LMMs on video QA tasks.Summary
AI-Generated Summary