视觉语言模型中的听者奖励思维在图像偏好中的应用
Listener-Rewarded Thinking in VLMs for Image Preferences
June 28, 2025
作者: Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
cs.AI
摘要
训练出稳健且可泛化的人类视觉偏好奖励模型,对于使文本到图像及文本到视频生成模型与人类意图对齐至关重要。然而,现有奖励模型往往难以泛化,且监督微调易导致记忆效应,需要复杂的标注流程。尽管强化学习(RL),特别是群体相对策略优化(GRPO),提升了泛化能力,但我们发现了一个关键缺陷:当模型的推理轨迹与独立、冻结的视觉语言模型(“倾听者”)对同一输出的评估相矛盾时,推理准确性会显著下降。为解决此问题,我们引入了倾听者增强的GRPO框架。在此框架中,倾听者重新评估推理者的思维链,提供密集且校准的置信度评分,以此塑造RL的奖励信号。这不仅激励推理者给出正确答案,还促使其生成能说服独立模型的解释。我们的倾听者引导奖励方案在ImageReward基准上达到了最佳准确率(67.4%),在大规模人类偏好数据集(120万次投票)上显著提升了分布外(OOD)性能(相比单纯推理者提升高达6%),并减少了与强GRPO及SFT基线相比的推理矛盾。这些结果表明,基于倾听者的奖励为视觉语言模型与细腻人类偏好的对齐提供了一条可扩展且数据高效的路径。我们将在此发布我们的推理模型:https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner。
English
Training robust and generalizable reward models for human visual preferences
is essential for aligning text-to-image and text-to-video generative models
with human intent. However, current reward models often fail to generalize, and
supervised fine-tuning leads to memorization, demanding complex annotation
pipelines. While reinforcement learning (RL), specifically Group Relative
Policy Optimization (GRPO), improves generalization, we uncover a key failure
mode: a significant drop in reasoning accuracy occurs when a model's reasoning
trace contradicts that of an independent, frozen vision-language model
("listener") evaluating the same output. To address this, we introduce a
listener-augmented GRPO framework. Here, the listener re-evaluates the
reasoner's chain-of-thought to provide a dense, calibrated confidence score,
shaping the RL reward signal. This encourages the reasoner not only to answer
correctly, but to produce explanations that are persuasive to an independent
model. Our listener-shaped reward scheme achieves best accuracy on the
ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD)
performance on a large-scale human preference dataset (1.2M votes, up to +6%
over naive reasoner), and reduces reasoning contradictions compared to strong
GRPO and SFT baselines. These results demonstrate that listener-based rewards
provide a scalable, data-efficient path to aligning vision-language models with
nuanced human preferences. We will release our reasoning model here:
https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.