ChatPaper.aiChatPaper

Agent0-VL:探索工具整合視覺語言推理的自進化智能體

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

November 25, 2025
作者: Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao
cs.AI

摘要

視覺語言智慧體在多模態推理任務中取得了顯著進展,但其學習能力仍受人為標註監督的限制。近期提出的自我獎勵方法試圖突破這一限制,讓模型能夠擔任自身的評判者或獎勵提供者。然而,純基於文本的自我評估難以驗證複雜的視覺推理步驟,且常出現評估幻覺問題。為解決這些挑戰,受工具增強推理的最新進展啟發,我們提出 Agent0-VL——一個通過工具增強推理實現持續自我進化的視覺語言智慧體。該架構將工具運用整合至推理、自我評估與自我修正全流程,使模型能透過證據驅動的分析進行反思、驗證與推理優化。我們在單一大型視覺語言模型中統一了兩個協同角色:執行多輪工具增強推理的求解器,以及通過工具錨定批判生成結構化反饋與細粒度自我獎勵的驗證器。這些角色通過「自我演化推理循環」互動,基於工具的驗證與強化學習共同對齊推理和評估分佈,實現穩定的自我提升。通過這種零外部獎勵的演化機制,Agent0-VL在無需人為標註或外部獎勵模型的情況下,實現了推理與驗證行為的自主對齊與持續改進。在幾何問題求解和視覺科學分析任務上的實驗表明,Agent0-VL相較基礎模型性能提升12.5%。程式碼已開源於:https://github.com/aiming-lab/Agent0/Agent0-VL{此 HTTPS 網址}。
English
Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0/Agent0-VL{this https URL}.
PDF462December 1, 2025