Omni-Reward：基于自由形式偏好的通用全模态奖励模型构建

摘要

奖励模型在将AI行为与人类偏好对齐方面发挥着关键作用，但仍面临两大根本性挑战：（1）模态失衡——现有奖励模型主要集中于文本和图像模态，对视频、音频等其他模态的支持有限；（2）偏好固化——基于固定二元偏好对的训练难以捕捉个性化偏好的复杂性和多样性。为解决上述问题，我们提出Omni-Reward通用奖励建模框架，通过支持自由形式偏好向通用全模态奖励建模迈出重要一步，具体包含：（1）评估体系：建立首个支持自由形式偏好的全模态奖励模型基准Omni-RewardBench，涵盖文本、图像、视频、音频及3D五大模态的九类任务；（2）数据构建：打造多模态偏好数据集Omni-RewardData，包含24.8万组通用偏好对和6.9万组指令调优对，用于训练通用全模态奖励模型；（3）模型架构：提出Omni-RewardModel，集成判别式与生成式奖励模型，在Omni-RewardBench及其他主流奖励建模基准上均表现出色。

English

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

Omni-Reward：基于自由形式偏好的通用全模态奖励模型构建

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

摘要

Support