SAIL-VL2技术报告

摘要

我们推出SAIL-VL2，这是一款开放套件的视觉-语言基础模型（LVM），旨在实现全面的多模态理解与推理。作为SAIL-VL的继任者，SAIL-VL2在2B和8B参数规模上，在多样化的图像与视频基准测试中均达到了业界领先水平，展现了从细粒度感知到复杂推理的强大能力。其高效性源于三大核心创新。首先，通过大规模数据筛选管道，结合评分与过滤策略，提升了标注、OCR、问答及视频数据的质量与分布均衡性，从而提高了训练效率。其次，采用渐进式训练框架，从强大的预训练视觉编码器（SAIL-ViT）起步，经过多模态预训练，最终达到思维融合的SFT-RL混合范式，系统性地增强了模型能力。第三，架构创新不仅限于密集大语言模型，还扩展至高效的稀疏专家混合（MoE）设计。凭借这些贡献，SAIL-VL2在106个数据集上展现了竞争力，并在MMMU和MathVista等挑战性推理基准上取得了顶尖成绩。此外，在OpenCompass排行榜上，SAIL-VL2-2B在4B参数规模以下的官方开源模型中名列前茅，同时为开源多模态社区提供了一个高效且可扩展的基础平台。

English

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

SAIL-VL2技术报告

SAIL-VL2 Technical Report

摘要

Support