ChatPaper.aiChatPaper

Silkie:針對大型視覺語言模型的偏好提煉

Silkie: Preference Distillation for Large Visual Language Models

December 17, 2023
作者: Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
cs.AI

摘要

本文探討偏好濃縮對大型視覺語言模型(LVLMs)的影響,以提升其生成有助且忠實響應並銜接視覺內容的能力。我們首先利用人工智慧標註建立了一個視覺語言反饋(VLFeedback)數據集。具體而言,回應是由從各種數據集中獲取的多模式指令條件下從12個LVLMs中抽樣生成的模型產生的。我們採用GPT-4V來評估生成輸出的幫助性、視覺忠實度和倫理考量。此外,偏好監督通過直接偏好優化(DPO)方法濃縮到Qwen-VL-Chat中。結果模型Silkie,在感知和認知能力方面相對提高了6.9%和9.5%的MME基準。Silkie還通過在MMHal-Bench基準上設置了新的最佳得分3.02,展示了減少幻覺的能力。進一步分析顯示,DPO與我們的VLFeedback數據集主要增強了LVLMs的細粒度感知和複雜認知能力,相較於人工標註的偏好數據集,帶來了更全面的改進。
English
This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.
PDF111December 15, 2024