Silkie: 大規模視覚言語モデルのための選好蒸留

要旨

本論文では、大規模視覚言語モデル（LVLM）のための選好蒸留を探求し、視覚的コンテキストに基づいた有用で忠実な応答生成能力の向上を目指す。まず、AIアノテーションを活用して視覚言語フィードバック（VLFeedback）データセットを構築する。具体的には、複数のデータセットから得られたマルチモーダル指示に基づいて、12のLVLMからサンプリングされたモデルによって応答を生成する。生成された出力については、GPT-4Vを用いて有用性、視覚的忠実性、倫理的考慮の観点から評価を行う。さらに、選好監視を直接選好最適化（DPO）法を通じてQwen-VL-Chatに蒸留する。その結果得られたモデルSilkieは、MMEベンチマークにおいて知覚能力と認知能力に関してそれぞれ6.9％と9.5％の相対的改善を達成した。また、SilkieはMMHal-Benchベンチマークにおいて3.02の新たな最先端スコアを記録し、幻覚の低減を示した。さらなる分析により、我々のVLFeedbackデータセットを用いたDPOは、LVLMの細粒度知覚能力と複雑な認知能力を主に向上させ、人間によるアノテーション選好データセットと比較してより包括的な改善をもたらすことが明らかとなった。

English

This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.

Silkie: 大規模視覚言語モデルのための選好蒸留

Silkie: Preference Distillation for Large Visual Language Models

要旨

Support