MIA-DPO：大型视觉语言模型的多图像增强直接偏好优化

摘要

视觉偏好对齐涉及训练大型视觉-语言模型（LVLMs）以预测人类在视觉输入之间的偏好。通常通过使用标记的选择/拒绝对的数据集，并利用像直接偏好优化（DPO）这样的优化算法来实现。现有的视觉对齐方法主要设计用于单图像场景，由于多图像任务的复杂性，缺乏多样化的训练数据和标记选择/拒绝对的高成本，因此难以有效处理。我们提出了多图像增强直接偏好优化（MIA-DPO），这是一种有效处理多图像输入的视觉偏好对齐方法。MIA-DPO通过将单图像数据扩展为以网格拼贴或画中画格式排列的无关图像，显著降低了与多图像数据标注相关的成本，从而缓解了多样化多图像训练数据的稀缺性。我们的观察表明，LVLMs的注意力值在不同图像之间变化很大。我们利用注意力值来识别并过滤模型可能错误关注的拒绝响应。我们的注意力感知选择用于构建选择/拒绝对，而无需依赖于（i）人类注释，（ii）额外数据和（iii）外部模型或API。MIA-DPO与各种架构兼容，并在五个多图像基准测试中优于现有方法，在LLaVA-v1.5上平均性能提升3.0％，在最近的InternLM-XC2.5上提升4.3％。此外，MIA-DPO对模型理解单图像的能力影响很小。

English

Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the costs associated with multi-image data annotations. Our observation reveals that attention values of LVLMs vary considerably across different images. We use attention values to identify and filter out rejected responses the model may have mistakenly focused on. Our attention-aware selection for constructing the chosen/rejected pairs without relying on (i) human annotation, (ii) extra data, and (iii) external models or APIs. MIA-DPO is compatible with various architectures and outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's ability to understand single images.

MIA-DPO：大型视觉语言模型的多图像增强直接偏好优化

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

摘要

Support