ChatPaper.aiChatPaper

InternVL3.5:提升開源多模態模型的多功能性、推理能力與效率

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

August 25, 2025
作者: Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo
cs.AI

摘要

我們推出InternVL 3.5,這是一個新的開源多模態模型系列,顯著提升了InternVL系列的多功能性、推理能力和推理效率。關鍵創新在於級聯強化學習(Cascade RL)框架,該框架通過兩階段過程增強推理:離線RL實現穩定收斂,在線RL進行精細對齊。這種由粗到細的訓練策略在下游推理任務(如MMMU和MathVista)上帶來顯著改進。為優化效率,我們提出視覺分辨率路由器(ViR),在不影響性能的情況下動態調整視覺標記的分辨率。結合ViR,我們的解耦視覺-語言部署(DvD)策略將視覺編碼器和語言模型分佈在不同GPU上,有效平衡計算負載。這些貢獻共同使InternVL3.5在整體推理性能上相比前代InternVL3提升高達+16.0%,並實現4.05倍的推理加速。此外,InternVL3.5支持GUI交互和具身代理等新功能。值得注意的是,我們最大的模型InternVL3.5-241B-A28B在開源MLLM中,於通用多模態、推理、文本和代理任務上達到了最先進的成果,縮小了與GPT-5等領先商業模型的性能差距。所有模型和代碼均已公開發布。
English
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05times inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
PDF1073August 26, 2025