言語フィードバックからの確実な学習

要旨

観察と言語フィードバックからインタラクティブに学習することは、大規模言語モデル（LLM）エージェントの出現によってますます研究が進んでいる分野である。これまでに印象的な実証例が示されてきたが、これらの意思決定問題を原理的に定式化する試みはまだ不十分である。本論文では、言語フィードバックからの学習（LLF）問題を定式化し、潜在的な報酬にもかかわらず学習を可能にするための十分な仮説を提示し、LLF問題の難しさを特徴づける複雑性尺度として転移エリューダー次元を導入する。転移エリューダー次元が、フィードバック内の情報がLLF問題の学習複雑性を変化させるという直観を捉えていることを示す。また、豊富な言語フィードバックから学習することが報酬から学習するよりも指数関数的に速くなる場合を実証する。さらに、HELiXと呼ばれるノーリグレットアルゴリズムを開発し、転移エリューダー次元に応じた性能保証を持ちながら、逐次的なインタラクションを通じてLLF問題を解決することを証明する。いくつかの実証領域において、LLMを繰り返しプロンプトしても信頼性が得られない場合でも、HELiXが良好な性能を発揮することを示す。我々の貢献は、一般的な言語フィードバックからの原理的なインタラクティブ学習アルゴリズムの設計に向けた第一歩を記すものである。

English

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

言語フィードバックからの確実な学習

Provably Learning from Language Feedback

要旨

Support