透過自動獎勵建模與規劃實現自主代理的規模化擴展

摘要

大型語言模型（LLMs）在多種文本生成任務中展現了卓越的能力。然而，LLMs在需要多步決策和環境反饋的問題上仍存在困難，例如在線購物、科學推理和數學問題解決。與純文本數據不同，收集大規模的決策數據具有挑戰性。此外，許多強大的LLMs僅能通過API訪問，這由於成本和複雜性而阻礙了它們在代理任務上的微調。為了解決LLM代理的局限性，我們提出了一個框架，能夠自動從環境中學習獎勵模型，而無需人工註釋。該模型可用於評估LLM代理的行動軌跡，並為任務規劃提供啟發式方法。具體來說，我們的方法涉及使用一個基於LLM的代理在環境中隨機導航，生成多樣化的行動軌跡。隨後，利用另一個LLM為每個軌跡分配任務意圖，並合成一個負面回應以及正確的回應。這些三元組（任務意圖、正面回應和負面回應）隨後被用作訓練數據，以優化能夠對行動軌跡進行評分的獎勵模型。我們框架的有效性和通用性通過在不同代理基準上的評估得到了展示。總之，我們提出的框架在增強LLM代理的決策能力方面代表了重大進展。通過自動化獎勵模型的學習，我們克服了數據稀缺和API限制的挑戰，可能徹底改變LLMs在複雜和互動環境中的應用。這項研究為開發能夠應對需要多步決策的廣泛現實世界問題的更複雜AI代理鋪平了道路。

English

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

透過自動獎勵建模與規劃實現自主代理的規模化擴展

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

摘要

Support