Web-CogReasoner: Webエージェントのための知識誘導型認知推論に向けて

要旨

マルチモーダル大規模モデルは、ウェブエージェントの開発を大幅に進化させ、人間の認知に近い形でデジタル環境を認識し、相互作用することを可能にしました。本論文では、ウェブエージェントが効果的に認知的推論を行うためには、まず十分な知識を獲得する必要があると主張します。そこで、ウェブエージェントの能力を2つの重要な段階に分解します：知識内容の学習と認知的プロセスです。これを形式化するため、我々はWeb-CogKnowledge Frameworkを提案し、知識を「事実的」「概念的」「手続き的」の3つに分類します。このフレームワークでは、知識内容の学習は、エージェントの「記憶」と「理解」のプロセスに対応し、最初の2つの知識タイプに依存し、学習の「何」を表します。一方、認知的プロセスは「探索」に対応し、手続き的知識に基づいており、推論と行動の「方法」を定義します。知識獲得を促進するため、我々はWeb-CogDatasetを構築しました。これは14の実世界のウェブサイトからキュレーションされた構造化リソースであり、ウェブエージェントに必要な中核知識を体系的に習得するように設計されています。このデータセットは、エージェントの概念的基盤（理解が構築される「名詞」）として機能するだけでなく、推論と行動の方法を学ぶ基盤としても役立ちます。この基盤を基に、我々はこれらのプロセスを新しい知識駆動型のChain-of-Thought（CoT）推論フレームワークを通じて実践化し、提案するエージェントであるWeb-CogReasonerを開発・訓練しました。広範な実験により、特に構造化された知識が決定的な役割を果たす未見のタスクへの一般化において、既存モデルを大幅に上回る優位性が明らかになりました。厳密な評価を可能にするため、我々はWeb-CogBenchを導入しました。これは、定義された知識領域と認知能力にわたってエージェントのパフォーマンスを評価・比較するための包括的な評価スイートです。我々のコードとデータはhttps://github.com/Gnonymous/Web-CogReasonerで公開されています。

English

Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner

Web-CogReasoner: Webエージェントのための知識誘導型認知推論に向けて

Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents

要旨

Support