Calibrate-Then-Act: LLM 에이전트의 비용 인식 탐색

초록

LLM은 단일 응답으로 해결되지 않지만 정보를 얻기 위해 환경과 상호작용이 필요한 복잡한 문제에 점점 더 많이 활용되고 있습니다. 이러한 시나리오에서 LLM은 탐색을 중단하고 답변을 확정해야 할 시점에 내재된 비용-불확실성 트레이드오프에 대해 추론해야 합니다. 예를 들어, 프로그래밍 작업에서 LLM은 생성한 코드 조각의 정확성에 대해 확신이 서지 않으면 해당 코드를 테스트해야 합니다. 테스트 작성 비용은 0이 아니지만 일반적으로 실수를 저지르는 비용보다는 낮습니다. 본 연구에서는 LLM이 이러한 비용-불확실성 트레이드오프의 균형을 명시적으로 추론하도록 유도함으로써 더 최적의 환경 탐색을 수행할 수 있음을 보여줍니다. 우리는 정보 검색 및 코딩을 포함한 여러 작업을 불확실성 하의 순차적 의사 결정 문제로 정형화합니다. 각 문제에는 LLM 에이전트에 전달되는 사전 확률을 통해 추론할 수 있는 잠재 환경 상태가 존재합니다. 우리는 LLM에 이러한 추가 컨텍스트를 제공하여 더 최적으로 행동할 수 있도록 하는 Calibrate-Then-Act(CTA) 프레임워크를 소개합니다. 이 개선 효과는 기준 모델과 CTA 모두에 대한 강화 학습 훈련 하에서도 유지됩니다. 정보 탐색형 질의응답과 단순화된 코딩 작업에 대한 우리의 결과는 CTA를 통해 비용-편익 트레이드오프를 명시적으로 설정하는 것이 에이전트가 더 최적의 의사 결정 전략을 발견하는 데 도움이 될 수 있음을 보여줍니다.

English

LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

Calibrate-Then-Act: LLM 에이전트의 비용 인식 탐색

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

초록

Support