画像選好におけるVLMのリスナー報酬型思考

要旨

人間の視覚的選好に基づくロバストで汎化可能な報酬モデルのトレーニングは、テキストから画像やテキストから動画を生成するモデルを人間の意図に沿わせるために不可欠です。しかし、現在の報酬モデルはしばしば汎化に失敗し、教師ありファインチューニングは記憶化を招き、複雑なアノテーションパイプラインを必要とします。強化学習（RL）、特にGroup Relative Policy Optimization（GRPO）は汎化を改善しますが、重要な失敗モードが明らかになりました：モデルの推論トレースが、同じ出力を評価する独立した凍結された視覚言語モデル（「リスナー」）の推論トレースと矛盾する場合、推論精度が大幅に低下します。これを解決するため、リスナー拡張GRPOフレームワークを導入します。ここでは、リスナーが推論者の連鎖的思考を再評価し、緻密で較正された信頼度スコアを提供し、RLの報酬信号を形成します。これにより、推論者は正しく答えるだけでなく、独立したモデルにとって説得力のある説明を生成することが促されます。私たちのリスナー形状報酬スキームは、ImageRewardベンチマークで最高の精度（67.4%）を達成し、大規模な人間の選好データセット（120万票、素朴な推論者に対して最大+6%）での分布外（OOD）性能を大幅に改善し、強力なGRPOおよびSFTベースラインと比較して推論の矛盾を減少させます。これらの結果は、リスナーベースの報酬が、視覚言語モデルを微妙な人間の選好に沿わせるためのスケーラブルでデータ効率的な道を提供することを示しています。私たちの推論モデルはこちらでリリースします：https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner。

English

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

画像選好におけるVLMのリスナー報酬型思考

Listener-Rewarded Thinking in VLMs for Image Preferences

要旨

Support