InternVL3.5：推动开源多模态模型在通用性、推理能力与效率上的全面进步

摘要

我们推出InternVL 3.5，这是一个全新的开源多模态模型家族，在InternVL系列的基础上显著提升了多功能性、推理能力和推理效率。其核心创新在于级联强化学习（Cascade RL）框架，该框架通过两阶段过程增强推理能力：离线RL确保稳定收敛，在线RL实现精细对齐。这种由粗到精的训练策略在下游推理任务（如MMMU和MathVista）上带来了显著改进。为优化效率，我们提出了视觉分辨率路由器（ViR），它能动态调整视觉令牌的分辨率而不影响性能。结合ViR，我们的解耦视觉-语言部署（DvD）策略将视觉编码器和语言模型分别部署于不同GPU上，有效平衡了计算负载。这些贡献共同使InternVL3.5在整体推理性能上较前代InternVL3提升了高达+16.0%，推理速度加快了4.05倍。此外，InternVL3.5还支持GUI交互和具身代理等新功能。值得注意的是，我们最大的模型InternVL3.5-241B-A28B在开源多模态大语言模型（MLLMs）中，于通用多模态、推理、文本及代理任务上均取得了最先进成果，缩小了与GPT-5等领先商业模型的性能差距。所有模型及代码均已公开发布。

English

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05times inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

InternVL3.5：推动开源多模态模型在通用性、推理能力与效率上的全面进步

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

摘要

Support