Omni-Reward:基于自由形式偏好的通用全模态奖励模型构建
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
October 27, 2025
作者: Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
cs.AI
摘要
奖励模型在将AI行为与人类偏好对齐方面发挥着关键作用,但仍面临两大根本性挑战:(1)模态失衡——现有奖励模型主要集中于文本和图像模态,对视频、音频等其他模态的支持有限;(2)偏好固化——基于固定二元偏好对的训练难以捕捉个性化偏好的复杂性和多样性。为解决上述问题,我们提出Omni-Reward通用奖励建模框架,通过支持自由形式偏好向通用全模态奖励建模迈出重要一步,具体包含:(1)评估体系:建立首个支持自由形式偏好的全模态奖励模型基准Omni-RewardBench,涵盖文本、图像、视频、音频及3D五大模态的九类任务;(2)数据构建:打造多模态偏好数据集Omni-RewardData,包含24.8万组通用偏好对和6.9万组指令调优对,用于训练通用全模态奖励模型;(3)模型架构:提出Omni-RewardModel,集成判别式与生成式奖励模型,在Omni-RewardBench及其他主流奖励建模基准上均表现出色。
English
Reward models (RMs) play a critical role in aligning AI behaviors with human
preferences, yet they face two fundamental challenges: (1) Modality Imbalance,
where most RMs are mainly focused on text and image modalities, offering
limited support for video, audio, and other modalities; and (2) Preference
Rigidity, where training on fixed binary preference pairs fails to capture the
complexity and diversity of personalized preferences. To address the above
challenges, we propose Omni-Reward, a step toward generalist omni-modal reward
modeling with support for free-form preferences, consisting of: (1) Evaluation:
We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form
preferences, covering nine tasks across five modalities including text, image,
video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal
preference dataset comprising 248K general preference pairs and 69K
instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We
propose Omni-RewardModel, which includes both discriminative and generative
RMs, and achieves strong performance on Omni-RewardBench as well as other
widely used reward modeling benchmarks.