AutoCrawler:一個用於網路爬蟲生成的漸進式理解網路代理程式
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
April 19, 2024
作者: Wenhao Huang, Chenghao Peng, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Liqian Wen, Zulong Chen
cs.AI
摘要
網頁自動化是一項重要技術,通過自動執行常見的網頁操作,完成複雜的網頁任務,提高運營效率,減少手動干預的需求。傳統方法,如包裝器,在面對新網站時存在適應性和可擴展性有限的問題。另一方面,由大型語言模型(LLMs)賦能的生成式代理在開放世界情境中表現出性能和重用性不佳。在這項工作中,我們為垂直信息網頁引入了爬蟲生成任務,並提出了將LLMs與爬蟲相結合的範式,有助於爬蟲更有效地應對多樣化和變化多端的網頁環境。我們提出了AutoCrawler,一個利用HTML的階層結構進行漸進式理解的雙階段框架。通過自上而下和回溯操作,AutoCrawler能夠從錯誤的操作中學習,並持續修剪HTML以獲得更好的操作生成。我們通過多個LLMs進行了全面的實驗,展示了我們框架的有效性。本文資源可在https://github.com/EZ-hwh/AutoCrawler 找到。
English
Web automation is a significant technique that accomplishes complicated web
tasks by automating common web actions, enhancing operational efficiency, and
reducing the need for manual intervention. Traditional methods, such as
wrappers, suffer from limited adaptability and scalability when faced with a
new website. On the other hand, generative agents empowered by large language
models (LLMs) exhibit poor performance and reusability in open-world scenarios.
In this work, we introduce a crawler generation task for vertical information
web pages and the paradigm of combining LLMs with crawlers, which helps
crawlers handle diverse and changing web environments more efficiently. We
propose AutoCrawler, a two-stage framework that leverages the hierarchical
structure of HTML for progressive understanding. Through top-down and step-back
operations, AutoCrawler can learn from erroneous actions and continuously prune
HTML for better action generation. We conduct comprehensive experiments with
multiple LLMs and demonstrate the effectiveness of our framework. Resources of
this paper can be found at https://github.com/EZ-hwh/AutoCrawlerSummary
AI-Generated Summary