ChatPaper.aiChatPaper

重探“顿悟时刻”:视觉语言模型在推理扩展中能否真正实现自我验证?

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

June 20, 2025
作者: Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Minjia Zhang, Klara Nahrstedt
cs.AI

摘要

近期大型語言模型(LLMs)的進展表明,推理階段的計算技術,如解碼時縮放與自我精煉,能夠在不依賴外部知識的情況下顯著提升推理能力。這一成功的關鍵驅動力在於自我校正與自我驗證行為的出現,這些行為通常通過強化學習(RL)來激發。本文探討這些推理階段技術是否能夠有效延伸至視覺語言模型(VLMs),尤其是那些經過RL訓練的模型。我們發現,雖然多數投票與基於自我驗證的最佳N選擇等解碼策略均能提升VLM的推理性能,但依賴生成的方法(如前者)相較於依賴驗證的方法(如後者)能實現顯著更高的增益。此外,與RL調優模型相關的自我校正行為,如“靈光一現”時刻,並未帶來可衡量的增益。我們通過在推理時縮放框架內進行廣泛實驗,揭示了一個關鍵根源:經過RL訓練的VLMs在視覺與文本模態上仍缺乏穩健的自我驗證能力。
English
Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.
PDF121July 1, 2025