OpenVLThinker: 반복적 자기 개선을 통한 복잡한 시각-언어 추론에 대한 초기 탐구

초록

최근 DeepSeek-R1에서 보여준 발전은 검증 가능한 보상을 통한 강화 학습(RL)을 통해 대규모 언어 모델(LLMs)에서 자기 검증 및 자기 수정과 같은 정교한 행동을 포함한 복잡한 추론 능력을 달성할 수 있으며, AIME와 같은 도전적인 과제에서 모델 성능을 크게 향상시킬 수 있음을 보여주었습니다. 이러한 발견에 영감을 받아, 본 연구에서는 유사한 추론 능력이 대규모 시각-언어 모델(LVLMs)에 성공적으로 통합될 수 있는지 여부를 조사하고, 도전적인 다중 모드 추론 과제에 미치는 영향을 평가합니다. 우리는 경량 학습 데이터에 대한 지도 미세 조정(SFT)과 강화 학습(RL)을 반복적으로 활용하여 모델 일반화를 더욱 개선하는 접근 방식을 고려합니다. 초기에는 다양한 시각 데이터셋에서 추출한 고품질 이미지 캡션을 사용하여 순수 텍스트 R1 모델에서 추론 능력을 증류했습니다. 이후, 반복적인 RL 훈련을 통해 추론 능력을 더욱 향상시켰으며, 각 반복에서 RL로 개선된 모델이 다음 라운드를 위한 정제된 SFT 데이터셋을 생성했습니다. 이 반복적인 프로세스를 통해 MathVista, MathVerse, MathVision과 같은 도전적인 벤치마크에서 일관되게 향상된 추론 성능을 보이는 LVLM인 OpenVLThinker를 개발했으며, 이는 강력한 시각-언어 추론을 위한 우리의 전략의 잠재력을 입증합니다. 코드, 모델 및 데이터는 https://github.com/yihedeng9/OpenVLThinker에서 확인할 수 있습니다.

English

Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration's RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

OpenVLThinker: 반복적 자기 개선을 통한 복잡한 시각-언어 추론에 대한 초기 탐구

OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

초록

Support