테스트 타임 학습: 학습 가능한 적응 정책을 갖춘 언어 에이전트

초록

테스트 타임 러닝(TTL)은 언어 에이전트가 추론 시간에 환경과의 반복적 상호작용을 통해 성능을 점진적으로 개선할 수 있도록 합니다. TTL의 핵심은 이전 에피소드에서 얻은 경험을 바탕으로 행동 정책을 업데이트하여 향후 행동을 향상시키는 적응 정책입니다. 기존 방법은 하류 작업 개선을 위해 최적화하기보다는 고정된 수작업 방식의 적응 정책에 의존합니다. 우리는 최적의 적응 정책이 인간의 직관에 기반해 수동으로 설계되는 것이 아니라 과제 환경으로부터 학습되어야 한다고 주장합니다. 이를 위해 우리는 효과적인 적응 정책의 발견을 이중 수준 최적화 문제로 공식화하는 메타-TTL 프레임워크를 제안합니다. 이 프레임워크 내부에서 내부 루프는 표준 TTL 과정을 실행하며, 후보 적응 정책이 에이전트의 연속적 에피소드에서 오류를 수정하는 데 얼마나 효과적으로 기여하는지 측정합니다. 에이전트의 성능 지표를 바탕으로, 외부 루프는 다양한 훈련 과제 분포에 대해 진화적 탐색을 사용하여 적응 정책을 반복적으로 개선합니다. 우리는 메타-TTL을 Jericho와 WebArena-Lite에서 내부 분포(ID) 및 외부 분포(OOD) 설정 모두에 대해 다양한 메타 에이전트 백본을 사용하여 평가합니다. 두 벤치마크에서의 결과는 메타-TTL이 수작업 기준선을 지속적으로 능가함을 보여주며, 이는 최적화된 적응 정책이 훈련 과제 분포를 넘어 일반화 가능한 전이 전략을 내포하고 있음을 시사합니다.

English

Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.

테스트 타임 학습: 학습 가능한 적응 정책을 갖춘 언어 에이전트

Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

초록

Support