OViP:在线视觉-语言偏好学习
OViP: Online Vision-Language Preference Learning
May 21, 2025
作者: Shujun Liu, Siyuan Wang, Zejun Li, Jianxiang Wang, Cheng Zeng, Zhongyu Wei
cs.AI
摘要
大型视觉语言模型(LVLMs)仍易产生幻觉,常生成与视觉输入不符的内容。尽管近期方法通过多模态直接偏好优化(DPO)来缓解这一问题,但它们通常依赖预定义或随机编辑的负样本,这些样本未能反映模型的实际错误,限制了训练效果。本研究提出了一种在线视觉语言偏好学习(OViP)框架,该框架基于模型自身产生的幻觉输出动态构建对比训练数据。通过识别采样响应对之间的语义差异,并利用扩散模型合成负样本图像,OViP实时生成更相关的监督信号。这种基于失败驱动的训练方法实现了文本与视觉偏好的自适应对齐。此外,我们改进了现有评估协议,以更好地捕捉幻觉抑制与表达力之间的权衡。在幻觉和通用基准测试上的实验表明,OViP在有效减少幻觉的同时,保持了核心的多模态能力。
English
Large vision-language models (LVLMs) remain vulnerable to hallucination,
often generating content misaligned with visual inputs. While recent approaches
advance multi-modal Direct Preference Optimization (DPO) to mitigate
hallucination, they typically rely on predefined or randomly edited negative
samples that fail to reflect actual model errors, limiting training efficacy.
In this work, we propose an Online Vision-language Preference Learning (OViP)
framework that dynamically constructs contrastive training data based on the
model's own hallucinated outputs. By identifying semantic differences between
sampled response pairs and synthesizing negative images using a diffusion
model, OViP generates more relevant supervision signals in real time. This
failure-driven training enables adaptive alignment of both textual and visual
preferences. Moreover, we refine existing evaluation protocols to better
capture the trade-off between hallucination suppression and expressiveness.
Experiments on hallucination and general benchmarks demonstrate that OViP
effectively reduces hallucinations while preserving core multi-modal
capabilities.Summary
AI-Generated Summary