対称的視覚的コントラスト最適化：最小限のコントラスト画像による視覚-言語モデルの整合

要旨

最近の研究によると、大規模視覚言語モデル（VLM）は画像の内容を軽視し、言語モデルの事前知識に過度に依存する傾向があり、視覚に基づくタスクでの誤りや幻覚（hallucination）を引き起こすことが明らかになっています。この問題は、既存のVLMが細かな画像の詳細に正確に基づいたテキストを生成するように明示的に訓練されていないためであると私たちは仮説を立てました。VLMの訓練中に視覚的フィードバックを強化するため、私たちはS-VCO（Symmetrical Visual Contrastive Optimization）を提案します。これは、モデルが重要な視覚的詳細を捉え、それらを対応するテキストトークンと整合させるように導く新しいファインチューニング目標です。さらに、この詳細な整合を促進するために、MVCというペア画像-テキストデータセットを導入しました。これは、視覚的な反事実データを自動的にフィルタリングし、拡張して構築され、最小限の視覚的コントラスト（Minimal Visual Contrasts）を含む難しい対照的なケースでモデルに挑戦するものです。実験結果は、私たちの手法が、さまざまな能力とドメインをカバーする多様なベンチマークでVLMの性能を一貫して向上させ、幻覚を最大22％削減し、視覚中心および一般的なタスクで大幅な改善を達成することを示しています。特に、これらの改善は、視覚的依存度が高いベンチマークでより顕著になります。要約すると、S-VCOは、VLMの視覚依存タスクの性能を大幅に向上させながら、モデルの一般的な能力を維持または向上させます。私たちはコードをhttps://s-vco.github.io/で公開しています。

English

Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM's visually-dependent task performance while retaining or even improving the model's general abilities. We opensource our code at https://s-vco.github.io/

対称的視覚的コントラスト最適化：最小限のコントラスト画像による視覚-言語モデルの整合

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

要旨

Support