ChatPaper.aiChatPaper

Skywork-R1V4:透過圖像與深度研究交錯思考邁向能動型多模態智能

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

December 2, 2025
作者: Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou
cs.AI

摘要

儘管多模態智能體系統近期取得進展,現有方法仍常將圖像操作與網絡搜索視為分立能力,過度依賴成本高昂的強化學習,且缺乏基於真實工具執行軌跡的規劃。為解決這些局限,我們推出 Skywork-R1V4——一個 300 億參數(實際激活 30 億)的多模態智能體模型,它統一整合了多模態規劃、主動圖像操作(「以圖思考」)、深度多模態搜索,以及最關鍵的交替推理機制,能動態切換視覺操作與外部知識檢索。該模型僅通過對不足 3 萬條規劃-執行一致的高質量軌跡進行監督微調訓練,並經逐步一致性過濾驗證,在感知與多模態搜索基準測試中實現突破性表現:MMSearch 得分 66.1,FVQA 得分 67.2,在全部 11 項指標上超越 Gemini 2.5 Flash。Skywork-R1V4 在推理時展現出湧現的長程推理能力,可成功協調超過 10 次工具調用以解決複雜多步驟任務。我們的成果證明,僅需精心構建的監督學習即可實現高級多模態智能體能力,無需依賴任何強化學習。
English
Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.
PDF341December 4, 2025