可证明地从语言反馈中学习

摘要

通过观察与语言反馈进行交互式学习，是随着大型语言模型（LLM）代理的兴起而日益受到关注的研究领域。尽管已有诸多令人印象深刻的实证展示，但迄今为止，这些决策问题的理论框架仍显不足。本文中，我们正式定义了“从语言反馈中学习”（LLF）问题，提出了足以在潜在奖励存在的情况下实现学习的充分假设，并引入了转移规避维度作为衡量LLF问题难度的复杂性指标。我们证明，转移规避维度能够捕捉到反馈信息改变LLF问题学习复杂度的直观理解。我们展示了在某些情况下，从丰富的语言反馈中学习可以比从奖励中学习快指数级。我们开发了一种名为HELiX的无悔算法，该算法通过序列交互可证明地解决LLF问题，其性能保证与问题的转移规避维度成比例。在多个实证领域中，我们展示了即使反复提示LLM无法稳定工作时，HELiX仍能表现出色。我们的贡献标志着向设计基于通用语言反馈的交互式学习算法迈出了第一步。

English

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

可证明地从语言反馈中学习

Provably Learning from Language Feedback

摘要

Support