InternLM-XComposer2.5-Reward:一個簡單但有效的多模態獎勵模型
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
January 21, 2025
作者: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
cs.AI
摘要
儘管大型視覺語言模型(LVLMs)在視覺理解方面表現出色,但偶爾會生成不正確的輸出。儘管利用強化學習或測試時縮放的獎勵模型(RMs)有望提高生成質量,但仍存在一個關鍵差距:LVLMs的多模態RMs公開可用性有限,專有模型的實施細節通常不清楚。我們通過InternLM-XComposer2.5-Reward(IXC-2.5-Reward)來彌補這一差距,這是一個簡單而有效的多模態獎勵模型,可使LVLMs與人類偏好保持一致。為確保IXC-2.5-Reward的穩健性和多功能性,我們建立了一個高質量的多模態偏好語料庫,跨越文本、圖像和視頻輸入,涵蓋指令遵循、一般理解、文本豐富文件、數學推理和視頻理解等多個領域。IXC-2.5-Reward在最新的多模態獎勵模型基準上取得了出色的結果,並在僅文本獎勵模型基準上表現出競爭力。我們進一步展示了IXC-2.5-Reward的三個關鍵應用:(1)為RL訓練提供監督信號。我們將IXC-2.5-Reward與Proximal Policy Optimization(PPO)結合,產生IXC-2.5-Chat,在指令遵循和多模態開放式對話中展示出持續改進;(2)從候選回應中選擇最佳回應以進行測試時縮放;以及(3)從現有圖像和視頻指令調整訓練數據中過濾異常或噪聲樣本。為確保可重現性並促進進一步研究,我們已在https://github.com/InternLM/InternLM-XComposer 上開源了所有模型權重和訓練配方。
English
Despite the promising performance of Large Vision Language Models (LVLMs) in
visual understanding, they occasionally generate incorrect outputs. While
reward models (RMs) with reinforcement learning or test-time scaling offer the
potential for improving generation quality, a critical gap remains: publicly
available multi-modal RMs for LVLMs are scarce, and the implementation details
of proprietary models are often unclear. We bridge this gap with
InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective
multi-modal reward model that aligns LVLMs with human preferences. To ensure
the robustness and versatility of IXC-2.5-Reward, we set up a high-quality
multi-modal preference corpus spanning text, image, and video inputs across
diverse domains, such as instruction following, general understanding,
text-rich documents, mathematical reasoning, and video understanding.
IXC-2.5-Reward achieves excellent results on the latest multi-modal reward
model benchmark and shows competitive performance on text-only reward model
benchmarks. We further demonstrate three key applications of IXC-2.5-Reward:
(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward
with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows
consistent improvements in instruction following and multi-modal open-ended
dialogue; (2) Selecting the best response from candidate responses for
test-time scaling; and (3) Filtering outlier or noisy samples from existing
image and video instruction tuning training data. To ensure reproducibility and
facilitate further research, we have open-sourced all model weights and
training recipes at https://github.com/InternLM/InternLM-XComposerSummary
AI-Generated Summary