복잡하고 동적인 환경에서 입력 재구성이 도구 사용 정확도를 어떻게 향상시킬 수 있는가? τ-bench에 대한 연구

초록

대규모 언어 모델(LLM)의 추론 및 계획 능력의 최근 발전은 동적 환경에서 도구 사용이 가능한 자율 에이전트로서의 잠재력을 가능하게 했습니다. 그러나 tau-bench와 같은 다중 턴 대화 환경에서 이러한 에이전트들은 일관된 추론, 도메인별 정책 준수, 그리고 장기간의 도구 호출 및 대화에서 올바른 정보를 추출하는 데 어려움을 겪는 경우가 많습니다. 이러한 실패를 포착하고 완화하기 위해, 우리는 대화 궤적에서 발생하는 일반적인 오류에 대한 포괄적인 수동 분석을 수행했습니다. 그런 다음, 에이전트 의사 결정 개선을 위해 도구 호출 에이전트에 대한 입력 재구성을 실험했습니다. 마지막으로, 도구 호출 에이전트가 집중할 수 있도록 관련 도메인 규칙과 도구 제안을 추가하여 사용자 쿼리를 자동으로 재구성하는 입력 재구성 다중 에이전트(IRMA) 프레임워크를 제안합니다. 결과는 IRMA가 전체 pass^5 점수에서 ReAct, Function Calling, Self-Reflection을 각각 16.1%, 12.7%, 19.1% 앞서는 것으로 나타났습니다. 이러한 결과는 동적 환경에서 IRMA가 다른 방법들에 비해 우수한 신뢰성과 일관성을 보여준다는 것을 강조합니다.

English

Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like tau-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.

복잡하고 동적인 환경에서 입력 재구성이 도구 사용 정확도를 어떻게 향상시킬 수 있는가? τ-bench에 대한 연구

How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench

초록

Support