ChatPaper.aiChatPaper

AutoCrawler:一种用于网络爬虫生成的渐进式理解网络代理。

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

April 19, 2024
作者: Wenhao Huang, Chenghao Peng, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Liqian Wen, Zulong Chen
cs.AI

摘要

网络自动化是一项重要技术,通过自动化常见的网络操作,完成复杂的网络任务,提高运行效率,减少手动干预的需求。传统方法,如包装器,在面对新网站时存在适应性和可扩展性有限的问题。另一方面,由大型语言模型(LLMs)赋能的生成式代理在开放世界场景中表现出性能和可重用性较差。在这项工作中,我们引入了用于垂直信息网页的爬虫生成任务以及将LLMs与爬虫相结合的范式,帮助爬虫更有效地处理多样化和不断变化的网络环境。我们提出了AutoCrawler,这是一个利用HTML的分层结构进行渐进式理解的两阶段框架。通过自顶向下和回退操作,AutoCrawler能够从错误操作中学习,并不断修剪HTML以获得更好的操作生成。我们进行了多个LLMs的全面实验,并展示了我们框架的有效性。本文资源可在https://github.com/EZ-hwh/AutoCrawler找到。
English
Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at https://github.com/EZ-hwh/AutoCrawler

Summary

AI-Generated Summary

PDF441December 15, 2024