可證實的語言反饋學習

摘要

基於觀察與語言反饋的互動式學習，隨著大型語言模型（LLM）代理的興起，已成為日益受到關注的研究領域。儘管已有令人印象深刻的實證展示，但迄今為止，這些決策問題的理論框架仍顯不足。本文中，我們正式定義了“從語言反饋中學習”（LLF）問題，提出了足以在隱含獎勵條件下實現學習的充分假設，並引入轉移困惑維度作為衡量LLF問題難度的複雜性指標。我們證明，轉移困惑維度能夠捕捉到反饋信息改變LLF問題學習複雜度的直覺。我們展示了在某些情況下，從豐富的語言反饋中學習可以比從獎勵中學習快指數級。我們開發了一種名為HELiX的無悔算法，該算法通過序列交互可證明地解決LLF問題，其性能保證與問題的轉移困惑維度成正比。在多個實證領域中，我們展示了即使反覆提示LLM無法可靠工作時，HELiX仍能表現出色。我們的貢獻標誌著設計基於通用語言反饋的理論化互動學習算法邁出了第一步。

English

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

可證實的語言反饋學習

Provably Learning from Language Feedback

摘要

Support