SmartSnap:自我驗證代理的積極證據蒐集機制
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
December 26, 2025
作者: Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun
cs.AI
摘要
代理強化學習(RL)在複雜圖形用戶界面任務的自主代理開發中具有巨大潛力,但其可擴展性仍因任務完成驗證而嚴重受限。現有的任務驗證被視為被動的事後處理過程:驗證器(即基於規則的評分腳本、獎勵或評判模型,以及LLM-as-a-Judge)通過分析代理的完整交互軌跡來判定任務成功與否。這種處理包含無關噪聲歷史的冗長上下文的方式,為驗證協議帶來挑戰,導致成本高昂且可靠性低下。為突破此瓶頸,我們提出SmartSnap範式,實現從被動事後驗證到代理主動就地自驗證的範式轉變。我們引入自驗證代理這一新型代理架構,其具備雙重使命:不僅要完成任務,還需通過精選的快照證據證明任務達成。基於我們提出的3C原則(完整性、簡潔性、創造性),該代理利用在線環境的可訪問性,對最小化的決定性快照集進行自驗證。這些證據將作為通用LLM-as-a-Judge驗證器判定有效性和相關性的唯一材料。跨模型系列與規模的移動任務實驗表明,SmartSnap範式能以可擴展方式訓練LLM驅動的代理,使8B和30B模型分別獲得最高26.08%和16.66%的性能提升。解決方案尋找與證據搜尋的協同作用,培育出能與DeepSeek V3.1及Qwen3-235B-A22B競爭的高效自驗證代理。
English
Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.