Seed1.5-VL 기술 보고서

초록

우리는 일반적인 목적의 멀티모달 이해와 추론을 발전시키기 위해 설계된 비전-언어 기반 모델인 Seed1.5-VL을 소개합니다. Seed1.5-VL은 532M 파라미터의 비전 인코더와 20B 활성 파라미터를 가진 Mixture-of-Experts(MoE) 대형 언어 모델(LLM)로 구성되어 있습니다. 비교적 간결한 아키텍처임에도 불구하고, 이 모델은 다양한 공개 VLM 벤치마크와 내부 평가 스위트에서 강력한 성능을 보여주며, 60개 공개 벤치마크 중 38개에서 최첨단 성능을 달성했습니다. 또한, GUI 제어 및 게임 플레이와 같은 에이전트 중심 작업에서 Seed1.5-VL은 OpenAI CUA와 Claude 3.7을 포함한 주요 멀티모달 시스템을 능가합니다. 비디오 및 영상 이해를 넘어서, 이 모델은 강력한 추론 능력을 보여주며, 특히 시각적 퍼즐과 같은 멀티모달 추론 과제에 효과적입니다. 우리는 이러한 능력이 다양한 작업에 걸쳐 더 넓은 응용을 가능하게 할 것이라고 믿습니다. 이 보고서에서는 주로 모델 설계, 데이터 구축, 다양한 단계의 훈련을 통해 Seed1.5-VL을 구축한 경험을 종합적으로 검토하며, 이 보고서가 추가 연구에 영감을 줄 수 있기를 바랍니다. Seed1.5-VL은 현재 https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)에서 접근 가능합니다.

English

We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)