重探“顿悟时刻”：视觉语言模型在推理扩展中能否真正实现自我验证？

摘要

近期大型語言模型（LLMs）的進展表明，推理階段的計算技術，如解碼時縮放與自我精煉，能夠在不依賴外部知識的情況下顯著提升推理能力。這一成功的關鍵驅動力在於自我校正與自我驗證行為的出現，這些行為通常通過強化學習（RL）來激發。本文探討這些推理階段技術是否能夠有效延伸至視覺語言模型（VLMs），尤其是那些經過RL訓練的模型。我們發現，雖然多數投票與基於自我驗證的最佳N選擇等解碼策略均能提升VLM的推理性能，但依賴生成的方法（如前者）相較於依賴驗證的方法（如後者）能實現顯著更高的增益。此外，與RL調優模型相關的自我校正行為，如“靈光一現”時刻，並未帶來可衡量的增益。我們通過在推理時縮放框架內進行廣泛實驗，揭示了一個關鍵根源：經過RL訓練的VLMs在視覺與文本模態上仍缺乏穩健的自我驗證能力。

English

Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.

重探“顿悟时刻”：视觉语言模型在推理扩展中能否真正实现自我验证？

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

摘要

Support