ChatPaper.aiChatPaper

視覺語言模型中基於聽眾獎勵的圖像偏好思考

Listener-Rewarded Thinking in VLMs for Image Preferences

June 28, 2025
作者: Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
cs.AI

摘要

訓練出能夠廣泛適用且穩健的人類視覺偏好獎勵模型,對於使文本到圖像及文本到視頻生成模型與人類意圖保持一致至關重要。然而,現有的獎勵模型往往難以實現泛化,而監督式微調則易導致模型過度記憶,這要求建立複雜的註釋流程。儘管強化學習(RL),特別是群體相對策略優化(GRPO),提升了模型的泛化能力,但我們發現了一個關鍵的失敗模式:當模型的推理軌跡與評估同一輸出的獨立、固定視覺語言模型(“聽者”)的推理相矛盾時,推理準確性會顯著下降。為解決這一問題,我們引入了一種聽者增強的GRPO框架。在此框架中,聽者重新評估推理者的思維鏈,提供密集且校準的置信度分數,以此塑造RL的獎勵信號。這不僅鼓勵推理者給出正確答案,還促使其生成能夠說服獨立模型的解釋。我們的聽者形塑獎勵方案在ImageReward基準測試中達到了最佳準確率(67.4%),在大規模人類偏好數據集(120萬次投票,相比單純推理者提升高達+6%)上顯著改善了分佈外(OOD)性能,並與強勁的GRPO和SFT基線相比減少了推理矛盾。這些結果表明,基於聽者的獎勵提供了一條可擴展、數據高效的路徑,使視覺語言模型能夠與細膩的人類偏好保持一致。我們將在此發布我們的推理模型:https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner。
English
Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.
PDF181July 1, 2025