全能奖励模型：基于自由形式偏好的通用全模态奖励建模研究

摘要

奖励模型在将AI行为与人类偏好对齐方面发挥着关键作用，但仍面临两大核心挑战：（1）模态失衡——现有模型主要聚焦文本与图像模态，对视频、音频等其他模态的支持有限；（2）偏好固化——基于固定二元偏好对的训练难以捕捉个性化偏好的复杂性与多样性。为此，我们提出Omni-Reward，这一通用全模态奖励建模框架通过以下三方面实现自由形式偏好的支持：（1）评估体系：构建首个支持自由形式偏好的全模态基准Omni-RewardBench，涵盖文本、图像、视频、音频及3D五大模态的九类任务；（2）数据构建：整合24.8万通用偏好对与6.9万指令调优对，形成多模态偏好数据集Omni-RewardData；（3）模型设计：提出包含判别式与生成式奖励模型的Omni-RewardModel，在Omni-RewardBench及主流奖励建模基准上均表现优异。

English

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

全能奖励模型：基于自由形式偏好的通用全模态奖励建模研究

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

摘要

Support