Aha Moment再考：推論時のスケーリングにおいて、VLMは真に自己検証可能なのか？

要旨

大規模言語モデル（LLM）の最近の進展により、デコード時のスケーリングや自己改良などの推論時計算技術が、外部知識に依存せずに推論能力を大幅に向上させることが実証されています。この成功の主要な要因は、強化学習（RL）を通じて引き出される自己修正や自己検証の行動の出現です。本論文では、これらの推論時技術が視覚言語モデル（VLM）、特にRLで訓練されたモデルに効果的に拡張されるかどうかを調査します。我々は、多数決や自己検証を伴うbest-of-N選択などのデコード戦略がVLMの推論性能を向上させる一方で、前者のような生成に依存する方法が後者のような検証に依存する方法よりも大幅に高い効果を達成することを発見しました。さらに、RLで調整されたモデルにしばしば関連付けられる「ahaモーメント」のような自己修正行動は、測定可能な向上をもたらしません。推論時スケーリングフレームワーク内での広範な実験を通じて、その根本的な原因を特定しました：RLで訓練されたVLMは、視覚とテキストの両モダリティにわたる堅牢な自己検証能力を依然として欠いているのです。

English

Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.

Aha Moment再考：推論時のスケーリングにおいて、VLMは真に自己検証可能なのか？

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

要旨

Support