FinMCP-Bench: 모델 컨텍스트 프로토콜 하에서 실제 금융 도구 활용을 위한 LLM 에이전트 벤치마킹

초록

본 논문은 금융 모델 컨텍스트 프로토콜의 도구 호출을 통해 실제 금융 문제를 해결하는 대규모 언어 모델(LLM)의 성능을 평가하기 위한 새로운 벤치마크인 FinMCP-Bench를 소개한다. FinMCP-Bench는 10가지 주요 시나리오와 33가지 하위 시나리오에 걸친 613개의 샘플을 포함하며, 다양성과 현실성을 보장하기 위해 실제 사용자 질의와 합성 사용자 질의를 모두 특징으로 한다. 여기에는 65개의 실제 금융 MCP와 단일 도구, 다중 도구, 다중 턴이라는 세 가지 유형의 샘플이 통합되어 다양한 수준의 작업 복잡성에 걸쳐 모델을 평가할 수 있다. 본 벤치마크를 사용하여 우리는 일련의 주류 LLM을 체계적으로 평가하고, 도구 호출 정확도와 추론 능력을 명시적으로 측정하는 지표를 제안한다. FinMCP-Bench는 금융 LLM 에이전트 연구의 발전을 위한 표준화되고 실용적이며 도전적인 테스트베드를 제공한다.

English

This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

FinMCP-Bench: 모델 컨텍스트 프로토콜 하에서 실제 금융 도구 활용을 위한 LLM 에이전트 벤치마킹

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

초록

Support