MIA-DPO：用於大型視覺語言模型的多圖像增強直接偏好優化

摘要

視覺偏好對齊涉及訓練大型視覺語言模型（LVLMs）來預測人類在視覺輸入之間的偏好。通常透過使用包含已選擇/拒絕對的標記數據集，並應用像直接偏好優化（DPO）這樣的優化算法來實現。現有的視覺對齊方法，主要設計用於單圖像情境，往往難以有效處理多圖像任務的複雜性，原因在於多樣性訓練數據的稀缺性以及標記已選擇/拒絕對的高成本。我們提出了多圖像增強直接偏好優化（MIA-DPO），這是一種有效處理多圖像輸入的視覺偏好對齊方法。MIA-DPO通過將單圖像數據擴展為以網格拼貼或畫中畫格式排列的無關圖像，有效減少了與多圖像數據標註相關的成本。我們的觀察顯示，LVLMs的注意力值在不同圖像之間有顯著變化。我們使用注意力值來識別並過濾模型可能錯誤關注的被拒絕回應。我們的注意力感知選擇用於構建已選擇/拒絕對，而無需依賴於（i）人類標註、（ii）額外數據，以及（iii）外部模型或API。MIA-DPO與各種架構兼容，在五個多圖像基準測試中優於現有方法，在LLaVA-v1.5上實現了3.0%的平均性能提升，並在最新的InternLM-XC2.5上實現了4.3%的提升。此外，MIA-DPO對模型理解單圖像的能力影響微乎其微。

English

Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the costs associated with multi-image data annotations. Our observation reveals that attention values of LVLMs vary considerably across different images. We use attention values to identify and filter out rejected responses the model may have mistakenly focused on. Our attention-aware selection for constructing the chosen/rejected pairs without relying on (i) human annotation, (ii) extra data, and (iii) external models or APIs. MIA-DPO is compatible with various architectures and outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's ability to understand single images.

MIA-DPO：用於大型視覺語言模型的多圖像增強直接偏好優化

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

摘要

Support