Mind2Web 2 : Évaluation de la recherche agentique avec l'agent-comme-juge

papers.abstract

La recherche agentique, telle que les systèmes Deep Research où les grands modèles de langage naviguent de manière autonome sur le web, synthétisent des informations et renvoient des réponses complètes étayées par des citations, représente un changement majeur dans la manière dont les utilisateurs interagissent avec l'information à l'échelle du web. Bien qu'elle promette une plus grande efficacité et un déchargement cognitif, la complexité croissante et l'ouverture de la recherche agentique ont dépassé les benchmarks et méthodologies d'évaluation existants, qui supposent largement des horizons de recherche courts et des réponses statiques. Dans cet article, nous présentons Mind2Web 2, un benchmark de 130 tâches réalistes, de haute qualité et à long horizon, nécessitant une navigation web en temps réel et une synthèse approfondie d'informations, construit avec plus de 1 000 heures de travail humain. Pour relever le défi de l'évaluation des réponses complexes et variant dans le temps, nous proposons un nouveau cadre Agent-as-a-Judge. Notre méthode construit des agents évaluateurs spécifiques à chaque tâche basés sur une conception de grille d'évaluation en arborescence, afin d'évaluer automatiquement à la fois la justesse des réponses et l'attribution des sources. Nous menons une évaluation complète de neuf systèmes de recherche agentique de pointe et des performances humaines, ainsi qu'une analyse détaillée des erreurs pour en tirer des enseignements pour le développement futur. Le système le plus performant, OpenAI Deep Research, peut déjà atteindre 50 à 70 % des performances humaines tout en passant la moitié du temps, montrant un grand potentiel. Au total, Mind2Web 2 fournit une base rigoureuse pour le développement et l'évaluation de la prochaine génération de systèmes de recherche agentique.

English

Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

Mind2Web 2 : Évaluation de la recherche agentique avec l'agent-comme-juge

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

papers.abstract

Support