校准后行动：LLM智能体中的成本感知探索

摘要

大型语言模型正越来越多地用于解决复杂问题，这类问题往往无法通过单次响应完成，而是需要与环境交互以获取信息。在此类场景中，LLM必须权衡内在的成本与不确定性，以决定何时停止探索并给出最终答案。例如在编程任务中，当LLM对生成代码片段的正确性存疑时，应当对其进行测试；编写测试的成本虽不为零，但通常低于出错导致的代价。本研究提出，可以通过引导LLM显式权衡成本与不确定性的平衡关系，从而执行更优化的环境探索。我们将信息检索和代码生成等多类任务形式化为不确定性下的序列决策问题，每个问题都包含可通过先验知识进行推理的潜在环境状态，该先验会被输入至LLM智能体。我们提出了"校准后行动"框架，通过为LLM提供额外上下文使其采取更优行动。即使对基线方法和CTA框架同时进行强化学习训练，这种改进效果依然保持。在信息检索型问答和简化编程任务上的实验表明，通过CTA显式权衡成本效益能帮助智能体发现更优的决策策略。

English

LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

校准后行动：LLM智能体中的成本感知探索

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

摘要

Support