InternVL3.5：提升開源多模態模型的多功能性、推理能力與效率

摘要

我們推出InternVL 3.5，這是一個新的開源多模態模型系列，顯著提升了InternVL系列的多功能性、推理能力和推理效率。關鍵創新在於級聯強化學習（Cascade RL）框架，該框架通過兩階段過程增強推理：離線RL實現穩定收斂，在線RL進行精細對齊。這種由粗到細的訓練策略在下游推理任務（如MMMU和MathVista）上帶來顯著改進。為優化效率，我們提出視覺分辨率路由器（ViR），在不影響性能的情況下動態調整視覺標記的分辨率。結合ViR，我們的解耦視覺-語言部署（DvD）策略將視覺編碼器和語言模型分佈在不同GPU上，有效平衡計算負載。這些貢獻共同使InternVL3.5在整體推理性能上相比前代InternVL3提升高達+16.0%，並實現4.05倍的推理加速。此外，InternVL3.5支持GUI交互和具身代理等新功能。值得注意的是，我們最大的模型InternVL3.5-241B-A28B在開源MLLM中，於通用多模態、推理、文本和代理任務上達到了最先進的成果，縮小了與GPT-5等領先商業模型的性能差距。所有模型和代碼均已公開發布。

English

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05times inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.

InternVL3.5：提升開源多模態模型的多功能性、推理能力與效率

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

摘要

Support