ChatPaper.aiChatPaper

創新者視覺語言模型:面向科學發現的多模態大型語言模型

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

January 27, 2026
作者: Zichen Wen, Boxue Yang, Shuang Chen, Yaojie Zhang, Yuhang Han, Junlong Ke, Cong Wang, Yicheng Fu, Jiawang Zhao, Jiangchao Yao, Xi Fang, Zhen Wang, Henxing Cai, Lin Yao, Zhifeng Gao, Yanhui Hong, Nang Yuan, Yixuan Li, Guojiang Zhao, Haoyi Tao, Nan Wang, Han Lyu, Guolin Ke, Ning Liao, Xiaoxing Wang, Kai Chen, Zhiyu Li, Feiyu Xiong, Sihan Hu, Kun Chen, Yanfeng Wang, Weinan E, Linfeng Zhang, Linfeng Zhang
cs.AI

摘要

我們推出創新者視覺語言模型(Innovator-VL),這是一款科學多模態大語言模型,旨在提升跨科學領域的理解與推理能力,同時在通用視覺任務上保持卓越性能。有別於當前依賴大規模領域專用預訓練與不透明流程的趨勢,我們的研究證明:透過原則性訓練設計與透明方法論,能以顯著降低的數據需求實現強大的科學智能。(i)首先,我們提供完全透明、端到端可重現的訓練流程,涵蓋數據收集、清理、預處理、監督微調、強化學習及評估,並附有詳細的優化方案,便於學界進行系統性擴展。(ii)其次,創新者-VL展現出卓越的數據效率,僅使用不足五百萬經篩選的樣本(無需大規模預訓練)便在多項科學任務中達到競爭性表現,凸顯透過原則性數據選擇而非無差別擴張即可實現有效推理。(iii)第三,創新者-VL具備強泛化能力,在通用視覺、多模態推理及科學基準測試中均取得競爭力結果,表明科學對齊性可融入統一模型而不損害通用能力。我們的實踐證實:即使沒有大規模數據,仍能建構高效、可重現且高性能的科學多模態模型,為未來研究提供實用基礎。
English
We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.
PDF531January 30, 2026