스카이워크-R1V3 기술 보고서

초록

우리는 시각적 추론에 대한 새로운 접근 방식을 개척하는 고급 오픈소스 시각-언어 모델(VLM)인 Skywork-R1V3를 소개한다. 이 모델의 주요 혁신은 텍스트 전용 대형 언어 모델(LLM)의 추론 능력을 시각적 작업으로 효과적으로 전이하는 데 있다. Skywork-R1V3의 강력한 성능은 주로 추가적인 사전 학습 없이도 모델의 추론 능력을 효과적으로 활성화하고 강화하는 우리의 정교한 사후 학습 RL 프레임워크에서 비롯된다. 이 프레임워크를 통해, 우리는 다중 모달 추론 모델에서 강력한 교차 모달 정렬을 달성하기 위한 커넥터 모듈의 근본적인 역할을 추가로 발견한다. 또한, 우리는 RL 훈련 중 체크포인트 선택에 매우 효과적인 것으로 입증된 추론 능력의 독특한 지표인 핵심 추론 토큰의 엔트로피를 소개한다. Skywork-R1V3는 MMMU에서 최첨단 결과를 달성하며, 64.3%에서 76.0%로 크게 향상되었다. 이 성능은 초급 인간의 능력과 맞먹는다. 특히, 우리의 RL 기반 사후 학습 접근 방식은 38B 파라미터 모델조차도 최고의 클로즈드소스 VLM과 경쟁할 수 있게 한다. 이 구현은 수학적 추론을 다른 주제 관련 추론 작업으로 성공적으로 전이한다. 우리는 또한 커리큘럼 학습과 강화 미세 조정 전략에 대한 분석과 더불어 다중 모달 추론에 대한 광범위한 논의를 포함한다. Skywork-R1V3는 다중 모달 추론에서의 중요한 도약을 나타내며, RL이 오픈소스 VLM 능력을 발전시키는 강력한 엔진임을 보여준다.

English

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

스카이워크-R1V3 기술 보고서

Skywork-R1V3 Technical Report

초록

Support