Web-CogReasoner:面向网络智能体的知识驱动认知推理
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
August 3, 2025
作者: Yuhan Guo, Cong Guo, Aiwen Sun, Hongliang He, Xinyu Yang, Yue Lu, Yingji Zhang, Xuntao Guo, Dong Zhang, Jianzhuang Liu, Jiang Duan, Yijia Xiao, Liangjian Wen, Hai-Ming Xu, Yong Dai
cs.AI
摘要
多模态大规模模型显著推动了网络智能体的发展,使其能够以类似人类认知的方式感知和交互数字环境。本文主张,网络智能体首先需获取足够的知识,才能有效参与认知推理。因此,我们将网络智能体的能力分解为两个关键阶段:知识内容学习与认知过程。为形式化这一观点,我们提出了Web-CogKnowledge框架,将知识分类为事实性、概念性和程序性。在此框架中,知识内容学习对应智能体的记忆与理解过程,依赖于前两类知识,代表了学习的“是什么”;而认知过程则对应探索,基于程序性知识,定义了推理与行动的“如何”。为促进知识获取,我们构建了Web-CogDataset,这是一个从14个真实网站中精心策划的结构化资源,旨在系统性地灌输网络智能体所需的核心知识。该数据集作为智能体的概念基础——理解的“名词”——同时也是学习如何推理和行动的基础。基于此,我们通过一种新颖的知识驱动链式思维(CoT)推理框架,将这些过程操作化,开发并训练了我们提出的智能体——Web-CogReasoner。大量实验表明,其在泛化至未见任务时,尤其是在结构化知识起决定性作用的情况下,显著优于现有模型。为支持严谨评估,我们引入了Web-CogBench,这是一个全面的评估套件,旨在评估和比较智能体在划定知识领域及认知能力上的表现。我们的代码和数据已在https://github.com/Gnonymous/Web-CogReasoner开源。
English
Multimodal large-scale models have significantly advanced the development of
web agents, enabling perception and interaction with digital environments akin
to human cognition. In this paper, we argue that web agents must first acquire
sufficient knowledge to effectively engage in cognitive reasoning. Therefore,
we decompose a web agent's capabilities into two essential stages: knowledge
content learning and cognitive processes. To formalize this, we propose
Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and
Procedural. In this framework, knowledge content learning corresponds to the
agent's processes of Memorizing and Understanding, which rely on the first two
knowledge types, representing the "what" of learning. Conversely, cognitive
processes correspond to Exploring, grounded in Procedural knowledge, defining
the "how" of reasoning and action. To facilitate knowledge acquisition, we
construct the Web-CogDataset, a structured resource curated from 14 real-world
websites, designed to systematically instill core knowledge necessary for web
agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon
which comprehension is built-as well as the basis for learning how to reason
and act. Building on this foundation, we operationalize these processes through
a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing
and training our proposed agent, the Web-CogReasoner. Extensive experimentation
reveals its significant superiority over existing models, especially in
generalizing to unseen tasks where structured knowledge is decisive. To enable
rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation
suite designed to assess and compare agent performance across the delineated
knowledge domains and cognitive capabilities. Our code and data is open sourced
at https://github.com/Gnonymous/Web-CogReasoner