실키: 대규모 시각 언어 모델을 위한 선호도 정제

초록

본 논문은 대규모 시각-언어 모델(LVLMs)의 선호도 정제를 탐구하여 시각적 맥락에 기반한 유용하고 신뢰할 수 있는 응답 생성 능력을 향상시키는 방법을 제안한다. 먼저, AI 주석을 활용하여 시각-언어 피드백(VLFeedback) 데이터셋을 구축한다. 구체적으로, 다양한 데이터셋에서 추출된 다중 모달 명령어를 기반으로 12개의 LVLM에서 샘플링된 모델들이 응답을 생성한다. 생성된 출력물의 유용성, 시각적 신뢰성, 윤리적 고려 사항을 평가하기 위해 GPT-4V를 사용한다. 또한, 직접 선호도 최적화(DPO) 방법을 통해 Qwen-VL-Chat에 선호도 감독을 정제한다. 그 결과로 얻은 Silkie 모델은 MME 벤치마크에서 지각 및 인지 능력 측면에서 각각 6.9%와 9.5%의 상대적 개선을 달성한다. Silkie는 또한 MMHal-Bench 벤치마크에서 3.02의 새로운 최첨단 점수를 기록하여 환각 현상을 줄이는 데 성공했다. 추가 분석에 따르면, VLFeedback 데이터셋을 사용한 DPO는 LVLM의 세밀한 지각 능력과 복잡한 인지 능력을 주로 향상시켜, 인간 주석 선호도 데이터셋에 비해 더 포괄적인 개선을 이끌어냈다.

English

This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.

실키: 대규모 시각 언어 모델을 위한 선호도 정제

Silkie: Preference Distillation for Large Visual Language Models

초록

Support