WILBUR: Aprendizaje Adaptativo en Contexto para Agentes Web Robustos y Precisos

Resumen

En el ámbito de la investigación de agentes web, lograr tanto la generalización como la precisión sigue siendo un problema desafiante. Debido a la alta variabilidad en la estructura de los sitios web, los enfoques existentes a menudo fallan. Además, las técnicas actuales de ajuste fino y aprendizaje en contexto no logran generalizar en múltiples sitios web. Presentamos Wilbur, un enfoque que utiliza un modelo de clasificación diferenciable y una novedosa técnica de síntesis de instrucciones para poblar de manera óptima el prompt de un modelo de lenguaje de gran escala de caja negra con demostraciones de tareas de ejecuciones anteriores. Para maximizar las tasas de éxito de extremo a extremo, también proponemos un mecanismo de retroceso inteligente que aprende y se recupera de sus errores. Finalmente, demostramos que nuestro modelo de clasificación puede entrenarse con datos de un currículo automático generativo que muestrea objetivos representativos de un modelo de lenguaje, ejecuta el agente y lo evalúa automáticamente, sin necesidad de anotación manual. Wilbur logra resultados de vanguardia en el benchmark WebVoyager, superando a los modelos basados únicamente en texto en un 8% en general, y hasta en un 36% en ciertos sitios web. En el mismo benchmark, Wilbur está dentro del 5% de un modelo multimodal fuerte a pesar de recibir únicamente entradas textuales, y un análisis más detallado revela que un número considerable de fallos se debe a desafíos de ingeniería en la operación de la web.

English

In the realm of web agent research, achieving both generalization and accuracy remains a challenging problem. Due to high variance in website structure, existing approaches often fail. Moreover, existing fine-tuning and in-context learning techniques fail to generalize across multiple websites. We introduce Wilbur, an approach that uses a differentiable ranking model and a novel instruction synthesis technique to optimally populate a black-box large language model's prompt with task demonstrations from previous runs. To maximize end-to-end success rates, we also propose an intelligent backtracking mechanism that learns and recovers from its mistakes. Finally, we show that our ranking model can be trained on data from a generative auto-curriculum which samples representative goals from an LLM, runs the agent, and automatically evaluates it, with no manual annotation. Wilbur achieves state-of-the-art results on the WebVoyager benchmark, beating text-only models by 8% overall, and up to 36% on certain websites. On the same benchmark, Wilbur is within 5% of a strong multi-modal model despite only receiving textual inputs, and further analysis reveals a substantial number of failures are due to engineering challenges of operating the web.

WILBUR: Aprendizaje Adaptativo en Contexto para Agentes Web Robustos y Precisos

WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents

Resumen

Support