RLHF-V: 通过细粒度纠正人类反馈实现行为对齐，以实现可信赖的MLLMs

摘要

最近，多模态大型语言模型（MLLMs）在多模态理解、推理和交互方面展示出令人印象深刻的能力。然而，现有的MLLMs普遍存在严重的虚构问题，生成的文本与相关图像不符合事实。这一问题使得现有的MLLMs不可信，因此在现实世界（尤其是高风险领域）的应用中变得不切实际。为了解决这一挑战，我们提出了RLHF-V，通过细粒度纠正的人类反馈行为对齐来增强MLLM的可信度。具体来说，RLHF-V以片段级别对虚构进行人类偏好收集，并在人类反馈上执行密集的直接偏好优化。在自动和人类评估的五个基准测试中进行的全面实验表明，RLHF-V能够通过具有前景的数据和计算效率显著提高MLLM的可信行为。值得注意的是，使用1.4k个带标注的数据样本，RLHF-V将基础MLLM的虚构率显著降低了34.8%，优于在10k个带标注数据上训练的LLaVA-RLHF。最终模型在开源MLLM中在可信度方面实现了最先进的性能，并且在防止由于过度泛化引起的虚构方面显示出比GPT-4V更好的鲁棒性。我们在https://github.com/RLHF-V/RLHF-V 开源了我们的代码、模型和数据。

English

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically, RLHF-V collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8%, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs, and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. We open-source our code, model, and data at https://github.com/RLHF-V/RLHF-V.

RLHF-V: 通过细粒度纠正人类反馈实现行为对齐，以实现可信赖的MLLMs

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

摘要

Support