ChatPaper.aiChatPaper

MM-RLHF:多模態LLM對齊的下一步前進。

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

February 14, 2025
作者: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan
cs.AI

摘要

儘管多模式大型語言模型(MLLMs)取得了顯著進展,但大多數最先進的模型尚未與人類偏好進行深入對齊。這種差距存在是因為目前的對齊研究主要在特定領域取得了進展(例如,幻覺減少),而對於將模型與人類偏好對齊是否可以系統性地增強MLLM能力的更廣泛問題尚未得到充分探討。為此,我們引入了MM-RLHF,一個包含12萬個精細人工標註偏好比較對的數據集。該數據集相對現有資源來說是一個重大進步,提供了更大的規模、多樣性、標註細節和質量。利用這個數據集,我們提出了幾個關鍵創新,以提高獎勵模型的質量和對齊算法的效率。特別是,我們引入了基於評論的獎勵模型,在給出分數之前生成模型輸出的評論,相較於傳統的標量獎勵機制,提供了更好的可解釋性和更具信息性的反饋。此外,我們提出了動態獎勵縮放,一種根據獎勵信號調整每個樣本損失權重的方法,從而優化高質量比較對的使用。我們的方法在10個不同維度和27個基準測試中進行了嚴格評估,結果顯示模型性能顯著且一致地提升。具體來說,使用MM-RLHF和我們的對齊算法對LLaVA-ov-7B進行微調,對話能力提高了19.5%,安全性提高了60%。 我們已將偏好數據集、獎勵模型、訓練和評估代碼以及獎勵建模和安全性基準測試開源。有關更多詳細信息,請訪問我們的項目頁面:https://mm-rlhf.github.io。
English
Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions and 27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a 19.5% increase in conversational abilities and a 60% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.

Summary

AI-Generated Summary

PDF355February 17, 2025