MM-RLHF：マルチモーダルLLMアラインメントの次のステップ

要旨

マルチモーダル大規模言語モデル（MLLMs）における顕著な進展にもかかわらず、最先端のモデルの多くは人間の嗜好との十分なアラインメントを経ていません。このギャップは、現在のアラインメント研究が特定の領域（例：幻覚の削減）で主に進展を遂げている一方で、モデルを人間の嗜好に合わせることがMLLMの能力を体系的に向上させられるかというより広範な問いがほとんど未探求のままであることに起因しています。この目的のために、我々は12万の細粒度で人間が注釈を付けた嗜好比較ペアを含むデータセット、MM-RLHFを紹介します。このデータセットは、既存のリソースを大幅に上回る規模、多様性、注釈の粒度、および品質を提供します。このデータセットを活用し、報酬モデルの品質とアラインメントアルゴリズムの効率の両方を向上させるためのいくつかの重要な革新を提案します。特に、モデルの出力をスコア付けする前に批判を生成する「批判ベースの報酬モデル」を導入し、従来のスカラー報酬メカニズムと比較して解釈可能性と情報量の多いフィードバックを提供します。さらに、各サンプルの損失重みを報酬信号に応じて調整する「動的報酬スケーリング」を提案し、高品質な比較ペアの利用を最適化します。我々のアプローチは、10の異なる次元と27のベンチマークで厳密に評価され、モデルの性能に有意かつ一貫した改善が示されています。具体的には、LLaVA-ov-7BをMM-RLHFと我々のアラインメントアルゴリズムでファインチューニングすることで、会話能力が19.5％向上し、安全性が60％改善しました。我々は、嗜好データセット、報酬モデル、トレーニングおよび評価コード、ならびに報酬モデリングと安全性のベンチマークをオープンソースとして公開しています。詳細については、プロジェクトページ（https://mm-rlhf.github.io）をご覧ください。

English

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions and 27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a 19.5% increase in conversational abilities and a 60% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.

MM-RLHF：マルチモーダルLLMアラインメントの次のステップ

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

要旨

Support