ChatPaper.aiChatPaper

SAIL-VL2技术报告

SAIL-VL2 Technical Report

September 17, 2025
作者: Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng
cs.AI

摘要

我们推出SAIL-VL2,这是一款开放套件的视觉-语言基础模型(LVM),旨在实现全面的多模态理解与推理。作为SAIL-VL的继任者,SAIL-VL2在2B和8B参数规模上,在多样化的图像与视频基准测试中均达到了业界领先水平,展现了从细粒度感知到复杂推理的强大能力。其高效性源于三大核心创新。首先,通过大规模数据筛选管道,结合评分与过滤策略,提升了标注、OCR、问答及视频数据的质量与分布均衡性,从而提高了训练效率。其次,采用渐进式训练框架,从强大的预训练视觉编码器(SAIL-ViT)起步,经过多模态预训练,最终达到思维融合的SFT-RL混合范式,系统性地增强了模型能力。第三,架构创新不仅限于密集大语言模型,还扩展至高效的稀疏专家混合(MoE)设计。凭借这些贡献,SAIL-VL2在106个数据集上展现了竞争力,并在MMMU和MathVista等挑战性推理基准上取得了顶尖成绩。此外,在OpenCompass排行榜上,SAIL-VL2-2B在4B参数规模以下的官方开源模型中名列前茅,同时为开源多模态社区提供了一个高效且可扩展的基础平台。
English
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
PDF261September 18, 2025