FinMCP-Bench：モデルコンテキストプロトコル下における実世界金融ツール活用のためのLLMエージェントベンチマーキング

要旨

本論文では、金融モデルコンテキストプロトコル（MCP）のツール呼び出しを通じて現実の金融問題を解決する大規模言語モデル（LLM）の評価を行う新しいベンチマーク「FinMCP-Bench」を提案する。FinMCP-Benchは10の主シナリオと33のサブシナリオにまたがる613サンプルを含み、多様性と真正性を確保するため実ユーザークエリと合成ユーザークエリの両方を特徴とする。65の実金融MCPと、単一ツール・複数ツール・マルチターンの3種類のサンプルを統合し、様々なタスク複雑度にわたるモデル評価を可能にする。本ベンチマークを用いて、我々は主流LLM群を体系的に評価し、ツール呼び出し精度と推論能力を明示的に測定する指標を提案する。FinMCP-Benchは、金融LLMエージェント研究の発展に向けた標準化された実践的かつ挑戦的なテストベッドを提供する。

English

This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

FinMCP-Bench：モデルコンテキストプロトコル下における実世界金融ツール活用のためのLLMエージェントベンチマーキング

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

要旨

Support