Web-CogReasoner: 웹 에이전트를 위한 지식 기반 인지적 추론 모델

초록

멀티모달 대규모 모델은 웹 에이전트의 발전을 크게 촉진하여 인간의 인지와 유사한 방식으로 디지털 환경을 인지하고 상호작용할 수 있게 했습니다. 본 논문에서는 웹 에이전트가 인지적 추론에 효과적으로 참여하기 위해서는 먼저 충분한 지식을 습득해야 한다고 주장합니다. 따라서 우리는 웹 에이전트의 능력을 두 가지 필수 단계로 분해합니다: 지식 내용 학습과 인지 과정. 이를 공식화하기 위해, 우리는 Web-CogKnowledge 프레임워크를 제안하며, 지식을 사실적(Factual), 개념적(Conceptual), 절차적(Procedural)로 분류합니다. 이 프레임워크에서 지식 내용 학습은 에이전트의 기억(Memorizing)과 이해(Understanding) 과정에 해당하며, 이는 처음 두 가지 지식 유형에 의존하여 학습의 "무엇(what)"을 나타냅니다. 반면, 인지 과정은 절차적 지식에 기반한 탐색(Exploring)에 해당하며, 이는 추론과 행동의 "어떻게(how)"를 정의합니다. 지식 습득을 촉진하기 위해, 우리는 14개의 실제 웹사이트에서 체계적으로 수집한 구조화된 리소스인 Web-CogDataset을 구축했습니다. 이 데이터셋은 웹 에이전트에게 필요한 핵심 지식을 체계적으로 주입하도록 설계되었으며, 에이전트의 개념적 기반(이해를 구축하는 "명사")과 추론 및 행동을 학습하는 기반 역할을 합니다. 이 기반을 바탕으로, 우리는 새로운 지식 기반 Chain-of-Thought(CoT) 추론 프레임워크를 통해 이러한 과정을 운영화하고, 제안된 에이전트인 Web-CogReasoner를 개발하고 훈련시켰습니다. 광범위한 실험을 통해, 특히 구조화된 지식이 결정적인 역할을 하는 보이지 않는 작업에 일반화하는 데 있어 기존 모델 대비 상당한 우수성을 보여주었습니다. 엄격한 평가를 위해, 우리는 정의된 지식 영역과 인지 능력에 걸쳐 에이전트 성능을 평가하고 비교하기 위한 포괄적인 평가 도구인 Web-CogBench을 소개합니다. 우리의 코드와 데이터는 https://github.com/Gnonymous/Web-CogReasoner에서 공개되어 있습니다.

English

Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner

Web-CogReasoner: 웹 에이전트를 위한 지식 기반 인지적 추론 모델

Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

초록

Support