SAIL-VL2 기술 보고서

초록

우리는 포괄적인 멀티모달 이해와 추론을 위한 오픈-슈트 비전-언어 기반 모델(SAIL-VL2)을 소개합니다. SAIL-VL의 후속 모델인 SAIL-VL2는 2B와 8B 파라미터 규모에서 다양한 이미지 및 비디오 벤치마크에서 최첨단 성능을 달성하며, 세밀한 인식부터 복잡한 추론에 이르는 강력한 능력을 입증했습니다. 이 모델의 효과성은 세 가지 핵심 혁신에 의해 주도됩니다. 첫째, 캡셔닝, OCR, QA, 비디오 데이터에 걸친 대규모 데이터 큐레이션 파이프라인과 점수화 및 필터링 전략은 품질과 분포를 모두 향상시켜 훈련 효율성을 개선합니다. 둘째, 강력한 사전 훈련된 비전 인코더(SAIL-ViT)로 시작하여 멀티모달 사전 훈련을 거쳐, 모델 능력을 체계적으로 강화하는 사고-융합 SFT-RL 하이브리드 패러다임으로 이어지는 점진적 훈련 프레임워크를 채택했습니다. 셋째, 밀집 LLM을 넘어 효율적인 희소 Mixture-of-Experts(MoE) 설계로 아키텍처를 확장했습니다. 이러한 기여를 통해 SAIL-VL2는 106개의 데이터셋에서 경쟁력 있는 성능을 보여주며, MMMU 및 MathVista와 같은 도전적인 추론 벤치마크에서 최첨단 결과를 달성했습니다. 또한, OpenCompass 리더보드에서 SAIL-VL2-2B는 4B 파라미터 규모 이하의 공식 출시된 오픈소스 모델 중 1위를 차지하며, 오픈소스 멀티모달 커뮤니티를 위한 효율적이고 확장 가능한 기반으로서의 역할을 수행하고 있습니다.

English

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

SAIL-VL2 기술 보고서

SAIL-VL2 Technical Report

초록

Support