다시 보고, 천천히 생각하라: 시각-언어 모델의 시각적 반사 능력 강화

초록

텍스트 전용 "느린 사고" 추론의 최근 발전은 이러한 능력을 시각-언어 모델(VLMs)로 전이하여 시각적 추론 모델(VRMs)을 훈련시키려는 노력으로 이어졌습니다. 그러나 이러한 전이는 중요한 과제에 직면해 있습니다: VRMs에서 효과적인 "느린 사고"는 시각적 반영, 즉 시각 정보를 기반으로 추론 과정을 점검하는 능력을 필요로 합니다. 정량적 분석을 통해, 현재의 VRMs가 생성된 응답이 길어질수록 시각 정보에 대한 주의가 급격히 감소함으로써 제한된 시각적 반영을 보인다는 것을 관찰했습니다. 이 문제를 해결하기 위해, 우리는 새로운 VRM인 Reflection-V를 제안합니다. 이 모델은 콜드 스타트를 위한 추론 데이터 구축과 강화 학습(RL)을 위한 보상 설계를 기반으로 시각적 반영을 강화합니다. 첫째, VLMs와 추론 LLMs 간의 상호작용을 통해 시각 중심의 추론 데이터를 구축함으로써 시각적 반영 패턴의 콜드 스타트 학습을 가능하게 합니다. 둘째, RL 과정에서 시각적 주의 기반 보상 모델을 사용하여 시각 정보를 기반으로 한 추론을 장려합니다. 결과적으로, Reflection-V는 여러 시각적 추론 벤치마크에서 상당한 개선을 보여줍니다. 더 나아가, Reflection-V는 시각적 추론 과정에서 시각 정보에 대한 더 강력하고 일관된 의존성을 유지하며, 이는 시각적 반영 능력의 효과적인 강화를 나타냅니다.

English

Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (VRMs). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires visual reflection, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM Reflection-V, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, Reflection-V demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, Reflection-V maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

다시 보고, 천천히 생각하라: 시각-언어 모델의 시각적 반사 능력 강화

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

초록

Support