重访“顿悟时刻”:视觉语言模型在推理扩展中真能实现自我验证吗?
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
June 20, 2025
作者: Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Minjia Zhang, Klara Nahrstedt
cs.AI
摘要
近期大型语言模型(LLMs)的进展表明,推理时计算技术,如解码时缩放和自我优化,无需依赖外部知识即可显著提升推理能力。这一成功的关键驱动力在于自我校正和自我验证行为的涌现,这些行为通常通过强化学习(RL)激发。本文探讨了这些推理时技术是否同样能有效应用于视觉语言模型(VLMs),特别是那些经过RL训练的模型。我们发现,尽管多数投票和带自我验证的最佳N选择等解码策略均能提升VLM的推理性能,但依赖生成的方法(如前者)相较于依赖验证的方法(如后者)能带来更为显著的增益。此外,常与RL调优模型相关联的自我校正行为,如“顿悟时刻”,并未带来可衡量的提升。通过广泛的实验,我们在推理时缩放框架内揭示了一个关键根源:经过RL训练的VLMs在视觉和文本模态上仍缺乏稳健的自我验证能力。
English
Recent advances in large language models (LLMs) have demonstrated that
inference-time computation techniques, such as decoding-time scaling and
self-refinement, can significantly enhance reasoning capabilities without
relying on external knowledge. A key driver of this success is the emergence of
self-correction and self-verification behaviors, often elicited through
reinforcement learning (RL). In this paper, we investigate whether these
inference-time techniques extend effectively to vision-language models (VLMs),
particularly those trained with RL. We find that while decoding strategies such
as majority voting and best-of-N selection with self-verification all improve
VLM reasoning performance, generation-reliant methods such as the former
achieve significantly higher gains versus verification-reliant methods such as
the latter. Additionally, the self-correction behavior often associated with
RL-tuned models, such as aha moment, does not lead to measurable gains. We show
via extensive experimentation within the inference-time scaling framework to
identify a key root cause: RL-trained VLMs still lack robust self-verification
capabilities across both visual and textual modalities.