ChatPaper.aiChatPaper

Silkie:针对大型视觉语言模型的偏好提取

Silkie: Preference Distillation for Large Visual Language Models

December 17, 2023
作者: Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
cs.AI

摘要

本文探讨了针对大型视觉语言模型(LVLMs)的偏好蒸馏,以提高其生成有用和忠实响应并锚定视觉背景的能力。我们首先构建了一个利用人工智能注释的视觉-语言反馈(VLFeedback)数据集。具体而言,响应是由从各种数据集中获取的多模态指令条件下抽样的12个LVLMs模型生成的。我们采用GPT-4V评估生成的输出在有用性、视觉忠实度和伦理考虑方面。此外,通过直接偏好优化(DPO)方法将偏好监督蒸馏到Qwen-VL-Chat中。由此产生的模型Silkie,在感知和认知能力方面相对提高了6.9%和9.5%的MME基准。Silkie还通过在MMHal-Bench基准上设定了新的最先进得分3.02来展示减少了幻觉。进一步分析显示,DPO与我们的VLFeedback数据集主要增强了LVLMs的细粒度感知和复杂认知能力,相较于人工注释的偏好数据集,导致了更全面的改进。
English
This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.
PDF111December 15, 2024