ビジョン言語モデルの連鎖的推論を向上させる

要旨

ビジョン言語モデル（VLMs）におけるChain-of-thought（CoT）推論は、解釈可能性と信頼性の向上に不可欠です。しかしながら、現在のトレーニング手法は、短い注釈が支配し、最小限の合理性しか持たないデータセットに依存しており、堅牢なCoT推論データが不足しています。本研究では、短い回答を使ってVLMをトレーニングすることが、より詳細な回答が必要な推論タスクに一般化されないことを示します。この課題に対処するため、二つのアプローチを提案します。まず、GPT-4oモデルから合理性を抽出してトレーニングデータを豊かにし、VLMのCoT性能を向上させる微調整を行います。次に、推論品質をさらに調整するために強化学習を適用します。具体的には、モデル生成の推論チェーンの予測と注釈付きの短い回答を比較し、正解と不正解のモデル生成ペアを構築します。このペアワイズデータを使用して、Direct Preference Optimizationアルゴリズムを適用してモデルの推論能力を改善します。実験では、ベンチマークデータセットにおけるCoT推論の著しい改善と、直接回答予測へのより良い一般化が示されました。この研究は、トレーニングに詳細な合理性を取り入れ、強化学習を活用してVLMの推論能力を強化する重要性を強調しています。

English

Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

ビジョン言語モデルの連鎖的推論を向上させる

Improve Vision Language Model Chain-of-thought Reasoning

要旨

Support