ToolTalk: 대화형 환경에서의 도구 사용 평가

초록

대규모 언어 모델(LLMs)은 추론 및 의사 결정 능력에서 엄청난 발전을 보여주며 사용자와 자연스러운 대화를 나눌 수 있습니다. 최근 많은 연구에서는 LLM 기반 어시스턴트를 외부 도구와 연계하여 비공개 정보나 최신 정보에 접근하고 사용자를 대신해 작업을 수행할 수 있도록 하는 데 초점을 맞추고 있습니다. 이러한 어시스턴트의 성능을 더 정확히 측정하기 위해, 본 논문은 다이얼로그를 통해 지정된 다단계 도구 사용이 필요한 복잡한 사용자 의도를 포함한 벤치마크인 ToolTalk을 소개합니다. ToolTalk은 7개의 플러그인으로 그룹화된 28개의 도구를 포함하며, 각 도구의 완전한 시뮬레이션 구현을 제공하여 실행 피드백에 의존하는 어시스턴트의 완전 자동화된 평가를 가능하게 합니다. 또한 ToolTalk은 정보 참조나 검색을 위한 도구뿐만 아니라 외부 세계에 영향을 미치는 도구를 강조합니다. ToolTalk에서 GPT-3.5와 GPT-4를 평가한 결과, 각각 26%와 50%의 성공률을 보였습니다. 오류 분석을 통해 세 가지 주요 범주를 도출하고 향후 개선 방향을 제안합니다. ToolTalk은 https://github.com/microsoft/ToolTalk에서 공개되었습니다.

English

Large language models (LLMs) have displayed massive improvements in reason- ing and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents re- quiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementa- tion of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement. We release ToolTalk at https://github.com/microsoft/ToolTalk.

ToolTalk: 대화형 환경에서의 도구 사용 평가

ToolTalk: Evaluating Tool-Usage in a Conversational Setting

초록

Support