大型語言模型能否在上下文中進行探索？

摘要

我們研究當代大型語言模型（LLMs）在探索方面的能力，這是強化學習和決策製定中的核心能力。我們專注於現有LLMs的本地性能，並沒有進行訓練干預。我們將LLMs部署為簡單的多臂擇機器人環境中的代理，完全在上下文中指定環境描述和互動歷史，即在LLM提示內。我們對GPT-3.5、GPT-4和Llama2進行實驗，使用各種提示設計，發現模型在沒有實質干預的情況下並不堅定地參與探索：i）在我們所有的實驗中，只有一種配置產生了令人滿意的探索行為：具有思維鏈推理和外部總結互動歷史的GPT-4，呈現為足夠統計量；ii）所有其他配置均未產生堅固的探索行為，包括具有思維鏈推理但未總結歷史的配置。儘管這些發現可以正面解讀，但它們表明，外部總結 - 在更複雜的情境中可能不可行 - 對於從LLM代理獲得理想行為至關重要。我們得出結論，可能需要進行非平凡的算法干預，如微調或數據集整理，才能賦予LLM為基礎的決策代理在複雜情境中的能力。

English

We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt. We experiment with GPT-3.5, GPT-4, and Llama2, using a variety of prompt designs, and find that the models do not robustly engage in exploration without substantial interventions: i) Across all of our experiments, only one configuration resulted in satisfactory exploratory behavior: GPT-4 with chain-of-thought reasoning and an externally summarized interaction history, presented as sufficient statistics; ii) All other configurations did not result in robust exploratory behavior, including those with chain-of-thought reasoning but unsummarized history. Although these findings can be interpreted positively, they suggest that external summarization -- which may not be possible in more complex settings -- is important for obtaining desirable behavior from LLM agents. We conclude that non-trivial algorithmic interventions, such as fine-tuning or dataset curation, may be required to empower LLM-based decision making agents in complex settings.

大型語言模型能否在上下文中進行探索？

Can large language models explore in-context?

摘要

Support