SAIL-VL2 技術報告書

要旨

我々は、包括的なマルチモーダル理解と推論のためのオープンソースのビジョン・ランゲージ基盤モデル（LVM）であるSAIL-VL2を紹介する。SAIL-VLの後継モデルとして、SAIL-VL2は2Bおよび8Bパラメータスケールにおいて、多様な画像およびビデオベンチマークで最先端の性能を達成し、細粒度の知覚から複雑な推論に至る強力な能力を実証している。その有効性を支える3つの核心的な革新がある。第一に、スコアリングとフィルタリング戦略を備えた大規模データキュレーションパイプラインにより、キャプショニング、OCR、QA、ビデオデータの品質と分布が向上し、トレーニング効率が改善される。第二に、強力な事前学習済みビジョンエンコーダ（SAIL-ViT）から始まり、マルチモーダル事前学習を経て、思考融合型のSFT-RLハイブリッドパラダイムに至る段階的なトレーニングフレームワークが、モデルの能力を体系的に強化する。第三に、密なLLMを超えた効率的なスパースMixture-of-Experts（MoE）設計を含むアーキテクチャの進展がある。これらの貢献により、SAIL-VL2は106のデータセットで競争力のある性能を示し、MMMUやMathVistaといった挑戦的な推論ベンチマークで最先端の結果を達成している。さらに、OpenCompassリーダーボードにおいて、SAIL-VL2-2Bは4Bパラメータスケール以下の公式リリースされたオープンソースモデルの中で首位を占め、オープンソースマルチモーダルコミュニティにとって効率的で拡張可能な基盤として機能している。

English

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

SAIL-VL2 技術報告書

SAIL-VL2 Technical Report

要旨

Support