Skywork R1V：チェーン・オブ・シンクを用いたマルチモーダル推論の先駆け

要旨

Skywork R1Vを紹介します。これは、R1シリーズの大規模言語モデル（LLM）を視覚モダリティに拡張するマルチモーダル推論モデルであり、効率的なマルチモーダル転移手法を採用しています。軽量な視覚プロジェクターを活用することで、Skywork R1Vは基盤となる言語モデルや視覚エンコーダーの再学習を必要とせず、シームレスなマルチモーダル適応を実現します。視覚とテキストの整合性を強化するため、反復的な教師ありファインチューニング（SFT）とグループ相対ポリシー最適化（GRPO）を組み合わせたハイブリッド最適化戦略を提案し、クロスモーダル統合の効率を大幅に向上させます。さらに、推論データ生成のための適応長チェーンオブソート蒸留アプローチを導入します。このアプローチは推論チェーンの長さを動的に最適化し、推論効率を向上させるとともに、過剰な推論による思考の行き詰まりを防ぎます。実証評価では、Skywork R1Vはわずか38Bのパラメータで競争力のある性能を発揮し、MMMUベンチマークで69.0、MathVistaで67.5のスコアを達成しました。同時に、AIMEで72.0、MATH500で94.0という印象的なスコアを示し、堅牢なテキスト推論性能を維持しています。Skywork R1Vのモデルウェイトは、オープン性と再現性を促進するために公開されています。

English

We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.

Skywork R1V：チェーン・オブ・シンクを用いたマルチモーダル推論の先駆け

Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

要旨

Support