Skywork-R1V4:通过图像与深度研究的交错思考迈向主体性多模态智能
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
December 2, 2025
作者: Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou
cs.AI
摘要
尽管多模态智能体系统近期取得进展,但现有方法往往将图像操作与网络搜索视为割裂的能力,严重依赖高成本的强化学习,且缺乏基于真实工具执行轨迹的规划。为突破这些局限,我们推出Skywork-R1V4——一个拥有300亿参数(实际激活30亿)的多模态智能体模型,它统一整合了多模态规划、主动图像操作("以图像思考")、深度多模态搜索,以及最关键的在视觉操作与外部知识检索间动态交替的穿插推理能力。该模型仅通过不到3万条规划-执行一致的高质量轨迹进行监督微调,并经过逐步一致性过滤验证,在感知与多模态搜索基准测试中实现领先性能:MMSearch得分66.1,FVQA得分67.2,在全部11项指标上超越Gemini 2.5 Flash。Skywork-R1V4在推理时展现出新兴的长程推理能力,可成功协调超过10次工具调用来解决复杂多步任务。我们的结果表明,仅通过精心设计的监督学习即可实现复杂的多模态智能体能力,无需任何强化学习依赖。
English
Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.