FinChain: 검증 가능한 사고 사슬 기반 금융 추론을 위한 상징적 벤치마크

초록

다단계 기호 추론(symbolic reasoning)은 금융 과제에서의 하위 작업 성능을 향상시키는 데 핵심적인 역할을 합니다. 그러나 이러한 능력을 체계적으로 평가하기 위한 벤치마크가 부족한 실정입니다. FinQA와 ConvFinQA와 같은 기존 데이터셋은 최종 수치적 답변만을 감독하며, 중간 추론 단계를 평가하지 않습니다. 이를 해결하기 위해, 우리는 검증 가능한 사고의 연쇄(Chain-of-Thought, CoT) 금융 추론을 위해 설계된 첫 번째 기호 벤치마크인 FinChain을 소개합니다. FinChain은 12개의 금융 도메인에 걸쳐 54개의 주제를 다루며, 각 주제마다 추론 복잡성과 필요한 도메인 전문성에 따라 다섯 가지의 파라미터화된 템플릿을 제공합니다. 각 데이터셋 인스턴스는 실행 가능한 Python 트레이스를 포함하여, 광범위한 학습 데이터의 자동 생성과 다른 도메인으로의 쉬운 적용을 가능하게 합니다. 또한, 우리는 최종 답변과 중간 추론을 모두 자동으로 평가하기 위한 새로운 메트릭인 ChainEval을 도입했습니다. 우리의 데이터셋에서 30개의 대형 언어 모델(LLM)을 벤치마킹한 결과, 최첨단 모델들도 다단계 금융 추론에서 상당한 개선의 여지가 있음을 발견했습니다. FinChain의 모든 템플릿과 평가 메트릭은 https://github.com/mbzuai-nlp/finchain에서 확인할 수 있습니다.

English

Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.

FinChain: 검증 가능한 사고 사슬 기반 금융 추론을 위한 상징적 벤치마크

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

초록

Support