視覚価値モデルを使用した推論時検索のスケーリングによる視覚理解の向上

要旨

ビジョン言語モデル（VLM）の重要な進展にもかかわらず、推論時の計算をスケーリングして応答品質を向上させる効果的なアプローチが不足しています。この能力は、最近の大規模言語モデルの研究において、自己改善モデルに向けた中核的なステップであるとされています。本論文では、ビジョン価値モデル（VisVM）を提案し、VLMの推論時検索を誘導して、視覚理解がより良い応答を生成することができます。具体的には、VisVMは、現在の検索ステップで生成された文章の品質を評価するだけでなく、現在のステップから生じるかもしれない後続の文章の品質を予測し、長期的な価値を提供します。このようにして、VisVMは、幻覚や詳細不足に陥りやすい文章を生成するVLMを避け、より高品質な応答を生成します。実験結果は、VisVMによる誘導検索が、貪欲なデコーディングや他の視覚報酬信号を用いた検索方法と比較して、より豊かな視覚詳細と幻覚が少ない記述的なキャプションを生成するVLMの能力を著しく向上させることを示しています。さらに、VisVMによるキャプションでモデルを自己学習させることで、多様なマルチモーダルベンチマーク全体でVLMの性能が向上することがわかり、自己改善型VLMの開発の可能性を示しています。当社の価値モデルとコードは、https://github.com/si0wang/VisVM で入手可能です。

English

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at https://github.com/si0wang/VisVM.

視覚価値モデルを使用した推論時検索のスケーリングによる視覚理解の向上

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

要旨

Summary

Support