언어 피드백으로부터의 검증 가능한 학습

초록

관찰과 언어 피드백을 통해 상호작용적으로 학습하는 것은 대형 언어 모델(LLM) 에이전트의 등장으로 인해 점점 더 활발히 연구되고 있는 분야이다. 인상적인 실증적 결과들이 제시되었지만, 이러한 의사결정 문제를 체계적으로 정립한 연구는 아직 부족한 실정이다. 본 논문에서는 언어 피드백 학습(Learning from Language Feedback, LLF) 문제를 공식화하고, 잠재적 보상에도 불구하고 학습이 가능하도록 충분한 가정을 제시하며, LLF 문제의 난이도를 특성화하기 위한 복잡도 측정 지표로서 전달 엘루더 차원(transfer eluder dimension)을 소개한다. 전달 엘루더 차원이 피드백에 포함된 정보가 LLF 문제의 학습 복잡도를 변화시킨다는 직관을 포착함을 보인다. 또한, 풍부한 언어 피드백을 통해 학습하는 것이 보상만을 통해 학습하는 것보다 기하급수적으로 빠를 수 있는 사례를 제시한다. 우리는 HELiX라는 후회 없는(no-regret) 알고리즘을 개발하여, 순차적 상호작용을 통해 LLF 문제를 해결할 수 있음을 증명하며, 이 알고리즘의 성능 보장이 문제의 전달 엘루더 차원에 따라 확장됨을 보인다. 여러 실증적 영역에서 HELiX가 LLM을 반복적으로 프롬프팅하는 방식이 안정적으로 작동하지 않는 상황에서도 우수한 성능을 보임을 입증한다. 본 연구의 공헌은 일반적인 언어 피드백을 통해 상호작용적 학습 알고리즘을 설계하기 위한 첫걸음으로서 의의를 가진다.

English

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

언어 피드백으로부터의 검증 가능한 학습

Provably Learning from Language Feedback

초록

Support