LLM 웹 에이전트를 훈련시키는 방법: 통계적 진단

초록

LLM 기반 웹 에이전트는 최근 상당한 진전을 이루었지만, 대부분이 클로즈드 소스 시스템에서 이루어져 오픈소스 대안과의 격차가 더욱 벌어졌습니다. 이러한 진전은 두 가지 주요 과제로 인해 지연되었습니다: 첫째, 다단계 웹 상호작용의 복잡성을 간과한 단일 단계 작업에 대한 좁은 초점; 둘째, LLM 기반 웹 에이전트를 사후 학습(post-train)시키는 데 필요한 높은 컴퓨팅 비용입니다. 이를 해결하기 위해, 우리는 LLM 웹 에이전트 사후 학습을 위한 컴퓨팅 자원 할당에 대한 첫 번째 통계적 근거를 바탕으로 한 연구를 제시합니다. 우리의 접근 방식은 두 단계 파이프라인을 사용하며, Llama 3.3 70B 교사 모델을 모방하도록 Llama 3.1 8B 학생 모델을 지도 미세 조정(supervised fine-tuning, SFT)을 통해 학습시킨 후, 온-정책 강화 학습(on-policy reinforcement learning)을 수행합니다. 이 과정은 하이퍼파라미터 선택에 매우 민감하여, 모든 가능성을 탐색하는 것이 비현실적임을 발견했습니다. 다른 연구자들이 비용이 많이 드는 시행착오를 겪지 않도록, 우리는 1,370개의 구성을 샘플링하고 부트스트래핑을 사용하여 효과적인 하이퍼파라미터를 추정했습니다. 우리의 결과는 SFT와 온-정책 RL을 결합하는 것이 WorkArena와 MiniWob++ 모두에서 각각의 접근 방식만을 사용하는 것보다 일관되게 더 나은 성능을 보여준다는 것을 입증했습니다. 더 나아가, 이 전략은 MiniWob++에서 순수 SFT의 최고 성능에 도달하는 데 필요한 컴퓨팅 자원의 55%만을 요구하며, 컴퓨팅-성능 파레토 프론티어를 효과적으로 앞당기고, 클로즈드 소스 모델과의 격차를 줄일 수 있는 유일한 전략입니다.

English

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

LLM 웹 에이전트를 훈련시키는 방법: 통계적 진단

How to Train Your LLM Web Agent: A Statistical Diagnosis

초록

Support