對稱視覺對比優化：以最少對比圖像實現視覺-語言模型對齊

摘要

近期研究表明，大型視覺-語言模型（VLMs）往往忽視圖像內容，過度依賴語言模型的先驗知識，導致在視覺基礎任務中出現錯誤和幻覺。我們假設這一問題的根源在於現有的VLMs並未經過明確訓練，以生成精確基於細粒度圖像細節的文本。為增強VLM訓練過程中的視覺反饋，我們提出了S-VCO（對稱視覺對比優化），這是一種新穎的微調目標，旨在引導模型捕捉關鍵視覺細節並將其與相應的文本標記對齊。為了進一步促進這種細緻的對齊，我們引入了MVC，這是一個通過自動過濾和增強視覺反事實數據構建的配對圖像-文本數據集，旨在通過涉及最小視覺對比的困難對比案例來挑戰模型。實驗表明，我們的方法在多樣化的基準測試中持續提升了VLM的性能，涵蓋了多種能力和領域，實現了幻覺減少高達22%，並在視覺中心和一般任務中取得了顯著進步。值得注意的是，這些改進在視覺依賴性更高的基準測試中變得尤為明顯。簡而言之，S-VCO顯著提升了VLM在視覺依賴任務上的表現，同時保持甚至提升了模型的通用能力。我們已在https://s-vco.github.io/開源了代碼。

English

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/

對稱視覺對比優化：以最少對比圖像實現視覺-語言模型對齊

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

摘要

Support