ChatPaper.aiChatPaper

InternVL3.5:推动开源多模态模型在通用性、推理能力与效率上的全面进步

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

August 25, 2025
作者: Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo
cs.AI

摘要

我们推出InternVL 3.5,这是一个全新的开源多模态模型家族,在InternVL系列的基础上显著提升了多功能性、推理能力和推理效率。其核心创新在于级联强化学习(Cascade RL)框架,该框架通过两阶段过程增强推理能力:离线RL确保稳定收敛,在线RL实现精细对齐。这种由粗到精的训练策略在下游推理任务(如MMMU和MathVista)上带来了显著改进。为优化效率,我们提出了视觉分辨率路由器(ViR),它能动态调整视觉令牌的分辨率而不影响性能。结合ViR,我们的解耦视觉-语言部署(DvD)策略将视觉编码器和语言模型分别部署于不同GPU上,有效平衡了计算负载。这些贡献共同使InternVL3.5在整体推理性能上较前代InternVL3提升了高达+16.0%,推理速度加快了4.05倍。此外,InternVL3.5还支持GUI交互和具身代理等新功能。值得注意的是,我们最大的模型InternVL3.5-241B-A28B在开源多模态大语言模型(MLLMs)中,于通用多模态、推理、文本及代理任务上均取得了最先进成果,缩小了与GPT-5等领先商业模型的性能差距。所有模型及代码均已公开发布。
English
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05times inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
PDF1123August 26, 2025