ChatPaper.aiChatPaper

Agent0-VL:探索面向工具集成视觉语言推理的自演进智能体

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

November 25, 2025
作者: Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao
cs.AI

摘要

视觉语言智能体在多模态推理任务中取得了显著进展,但其学习过程仍受限于人工标注监督的约束。近期提出的自奖励方法试图通过让模型担任自身的评判者或奖励提供者来突破这一限制。然而,纯文本的自评估难以验证复杂的视觉推理步骤,且常出现评估幻觉问题。为应对这些挑战,受工具增强推理最新进展的启发,我们提出Agent0-VL——一种通过工具增强推理实现持续自我进化的视觉语言智能体。该框架将工具使用不仅融入推理过程,更扩展到自我评估与修正环节,使模型能够通过证据驱动的分析实现推理过程的自我审视、验证与优化。我们在单一大规模视觉语言模型中统一了两个协同角色:执行多轮工具增强推理的求解器,以及通过工具锚定批判生成结构化反馈与细粒度自奖励的验证器。这些角色通过"自我进化推理循环"进行交互,其中基于工具的验证与强化学习共同对齐推理和评估分布,实现稳定的自我提升。通过这种零外部奖励的进化机制,Agent0-VL在无需人工标注或外部奖励模型的条件下,实现了推理行为与验证行为的对齐,达成持续自我改进。在几何问题求解和视觉科学分析任务上的实验表明,Agent0-VL相较基线模型性能提升12.5%。代码已开源于https://github.com/aiming-lab/Agent0/Agent0-VL{此https网址}。
English
Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0/Agent0-VL{this https URL}.
PDF462December 1, 2025