아하 모먼트 재고찰: 시각-언어 모델은 추론 시 스케일링에서 진정으로 자체 검증이 가능한가?

초록

최근 대규모 언어 모델(LLMs)의 발전은 디코딩 시 스케일링 및 자기 개선과 같은 추론 시점 계산 기술이 외부 지식에 의존하지 않고도 추론 능력을 크게 향상시킬 수 있음을 보여주었습니다. 이러한 성공의 주요 동인은 강화 학습(RL)을 통해 유도되는 자기 수정 및 자기 검증 행동의 등장입니다. 본 논문에서는 이러한 추론 시점 기술이 시각-언어 모델(VLMs), 특히 RL로 훈련된 모델에 효과적으로 확장될 수 있는지 조사합니다. 우리는 다수결 투표 및 자기 검증을 통한 best-of-N 선택과 같은 디코딩 전략이 VLM의 추론 성능을 모두 향상시키지만, 전자와 같은 생성에 의존하는 방법이 후자와 같은 검증에 의존하는 방법보다 훨씬 더 큰 성능 향상을 달성한다는 것을 발견했습니다. 또한, '아하 순간'과 같은 RL로 조정된 모델에서 종종 관찰되는 자기 수정 행동은 측정 가능한 성능 향상으로 이어지지 않았습니다. 우리는 추론 시점 스케일링 프레임워크 내에서 광범위한 실험을 통해 주요 근본 원인을 확인했습니다: RL로 훈련된 VLMs는 여전히 시각 및 텍스트 양쪽 모달리티에서 강력한 자기 검증 능력이 부족합니다.

English

Recent advances in large language models (LLMs) have demonstrated that inference-time computation techniques, such as decoding-time scaling and self-refinement, can significantly enhance reasoning capabilities without relying on external knowledge. A key driver of this success is the emergence of self-correction and self-verification behaviors, often elicited through reinforcement learning (RL). In this paper, we investigate whether these inference-time techniques extend effectively to vision-language models (VLMs), particularly those trained with RL. We find that while decoding strategies such as majority voting and best-of-N selection with self-verification all improve VLM reasoning performance, generation-reliant methods such as the former achieve significantly higher gains versus verification-reliant methods such as the latter. Additionally, the self-correction behavior often associated with RL-tuned models, such as aha moment, does not lead to measurable gains. We show via extensive experimentation within the inference-time scaling framework to identify a key root cause: RL-trained VLMs still lack robust self-verification capabilities across both visual and textual modalities.

아하 모먼트 재고찰: 시각-언어 모델은 추론 시 스케일링에서 진정으로 자체 검증이 가능한가?

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

초록

Support