SAIL-VL2 技術報告
SAIL-VL2 Technical Report
September 17, 2025
作者: Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng
cs.AI
摘要
我们推出SAIL-VL2,一个开放套件的视觉-语言基础模型(LVM),旨在实现全面的多模态理解与推理。作为SAIL-VL的继任者,SAIL-VL2在2B和8B参数规模上,在多样化的图像与视频基准测试中均达到了最先进的性能,展现了从细粒度感知到复杂推理的强大能力。其高效性得益于三大核心创新。首先,通过评分与筛选策略的大规模数据整理流程,提升了字幕生成、光学字符识别(OCR)、问答(QA)及视频数据的质量与分布均衡性,从而提高了训练效率。其次,采用渐进式训练框架,从强大的预训练视觉编码器(SAIL-ViT)起步,经过多模态预训练,最终达到思维融合的SFT-RL混合范式,系统性地增强了模型能力。第三,架构创新不仅限于密集的大型语言模型(LLMs),还扩展至高效的稀疏专家混合(MoE)设计。凭借这些贡献,SAIL-VL2在106个数据集上展现了竞争力,并在MMMU和MathVista等具有挑战性的推理基准测试中取得了顶尖成绩。此外,在OpenCompass排行榜上,SAIL-VL2-2B在4B参数规模以下的官方发布开源模型中位列第一,同时为开源多模态社区提供了一个高效且可扩展的基础平台。
English
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM)
for comprehensive multimodal understanding and reasoning. As the successor to
SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B
parameter scales across diverse image and video benchmarks, demonstrating
strong capabilities from fine-grained perception to complex reasoning. Three
core innovations drive its effectiveness. First, a large-scale data curation
pipeline with scoring and filtering strategies enhances both quality and
distribution across captioning, OCR, QA, and video data, improving training
efficiency. Second, a progressive training framework begins with a powerful
pre-trained vision encoder (SAIL-ViT), advances through multimodal
pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that
systematically strengthens model capabilities. Third, architectural advances
extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs.
With these contributions, SAIL-VL2 demonstrates competitive performance across
106 datasets and achieves state-of-the-art results on challenging reasoning
benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass
leaderboard, SAIL-VL2-2B ranks first among officially released open-source
models under the 4B parameter scale, while serving as an efficient and
extensible foundation for the open-source multimodal community.