通过自调用智能体实现图像化思考
Thinking with Images via Self-Calling Agent
December 9, 2025
作者: Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye
cs.AI
摘要
基于图像思维的推理范式通过将视觉信息作为动态元素融入思维链,展现出卓越的视觉推理能力。然而,由于依赖稀缺的高质量推理数据,通过强化学习优化交错式多模态思维链仍具挑战。本研究提出自调用思维链——一种新型视觉推理范式,它将交错式多模态思维链重构为具有自调用功能的纯语言思维链。具体而言,主代理将复杂视觉推理任务分解为原子子任务,并调用其虚拟副本(即参数共享子代理)在隔离上下文中解决问题。该范式无需显式的模态交错操作,因而具有显著的训练效能与效率优势。通过采用群体相对策略优化来强化有效推理行为,进一步提升了优化效果。在HR-Bench 4K上的实验表明,相较于强基线方法,自调用思维链将整体推理性能提升最高达1.9%,同时减少约75%的GPU计算时耗。代码已开源于:https://github.com/YWenxi/think-with-images-through-self-calling。
English
Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to 1.9% with sim 75% fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.