SAIL-VL2 技術報告

摘要

我们推出SAIL-VL2，一个开放套件的视觉-语言基础模型（LVM），旨在实现全面的多模态理解与推理。作为SAIL-VL的继任者，SAIL-VL2在2B和8B参数规模上，在多样化的图像与视频基准测试中均达到了最先进的性能，展现了从细粒度感知到复杂推理的强大能力。其高效性得益于三大核心创新。首先，通过评分与筛选策略的大规模数据整理流程，提升了字幕生成、光学字符识别（OCR）、问答（QA）及视频数据的质量与分布均衡性，从而提高了训练效率。其次，采用渐进式训练框架，从强大的预训练视觉编码器（SAIL-ViT）起步，经过多模态预训练，最终达到思维融合的SFT-RL混合范式，系统性地增强了模型能力。第三，架构创新不仅限于密集的大型语言模型（LLMs），还扩展至高效的稀疏专家混合（MoE）设计。凭借这些贡献，SAIL-VL2在106个数据集上展现了竞争力，并在MMMU和MathVista等具有挑战性的推理基准测试中取得了顶尖成绩。此外，在OpenCompass排行榜上，SAIL-VL2-2B在4B参数规模以下的官方发布开源模型中位列第一，同时为开源多模态社区提供了一个高效且可扩展的基础平台。

English

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

SAIL-VL2 技術報告

SAIL-VL2 Technical Report

摘要

Support