OViP：線上視覺語言偏好學習

摘要

大型视觉语言模型（LVLMs）仍易产生幻觉，常生成与视觉输入不符的内容。尽管近期方法通过多模态直接偏好优化（DPO）来缓解幻觉问题，但这些方法通常依赖于预定义或随机编辑的负样本，未能反映模型实际错误，限制了训练效果。本研究提出了一种在线视觉语言偏好学习（OViP）框架，该框架基于模型自身的幻觉输出动态构建对比训练数据。通过识别采样响应对之间的语义差异，并利用扩散模型合成负样本图像，OViP实时生成更具相关性的监督信号。这种基于失败驱动的训练方法实现了文本与视觉偏好的自适应对齐。此外，我们改进了现有评估协议，以更好地捕捉幻觉抑制与表达力之间的权衡。在幻觉和通用基准测试上的实验表明，OViP在保持核心多模态能力的同时，有效减少了幻觉现象。

English

Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.

OViP：線上視覺語言偏好學習

OViP: Online Vision-Language Preference Learning

摘要

Support