AutoCrawler: Webクローラー生成のための漸進的理解型Webエージェント

要旨

ウェブ自動化は、一般的なウェブ操作を自動化することで複雑なウェブタスクを達成し、業務効率を向上させ、手動介入の必要性を低減する重要な技術です。従来の方法、例えばラッパーは、新しいウェブサイトに直面した際に適応性と拡張性が限られるという課題を抱えています。一方、大規模言語モデル（LLM）を活用した生成エージェントは、オープンワールドシナリオにおいて性能と再利用性が低いという問題があります。本研究では、垂直情報ウェブページ向けのクローラー生成タスクと、LLMとクローラーを組み合わせるパラダイムを提案し、クローラーが多様で変化するウェブ環境をより効率的に処理することを支援します。我々は、HTMLの階層構造を活用して段階的な理解を進める二段階フレームワークであるAutoCrawlerを提案します。トップダウンおよびステップバック操作を通じて、AutoCrawlerは誤った操作から学習し、HTMLを継続的に刈り込むことでより良い操作生成を実現します。複数のLLMを用いた包括的な実験を行い、本フレームワークの有効性を実証しました。本論文のリソースはhttps://github.com/EZ-hwh/AutoCrawlerで公開されています。

English

Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at https://github.com/EZ-hwh/AutoCrawler

AutoCrawler: Webクローラー生成のための漸進的理解型Webエージェント

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

要旨

Support