FinChain:一个可验证链式思维金融推理的符号化基准
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
June 3, 2025
作者: Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi Georgiev, Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh, Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev, Tanmoy Chakraborty, Salem Lahlou, Veselin Stoyanov, Preslav Nakov
cs.AI
摘要
多步符号推理对于提升金融任务的下游性能至关重要。然而,系统评估这一能力的基准尚显不足。现有数据集如FinQA和ConvFinQA仅监督最终数值答案,而未评估中间推理步骤。为此,我们推出了FinChain,这是首个专为可验证的思维链(CoT)金融推理设计的符号基准。FinChain涵盖12个金融领域的54个主题,每个主题提供五种参数化模板,这些模板在推理复杂性和所需领域专业知识上各不相同。每个数据集实例均包含可执行的Python跟踪,便于自动生成大量训练数据,并易于适应其他领域。我们还引入了ChainEval,这是一种用于自动评估最终答案和中间推理的新指标。在我们的数据集上对30个大型语言模型进行基准测试,发现即使是当前最先进的模型,在多步金融推理方面仍有显著提升空间。FinChain的所有模板和评估指标均可在https://github.com/mbzuai-nlp/finchain获取。
English
Multi-step symbolic reasoning is critical for advancing downstream
performance on financial tasks. Yet, benchmarks for systematically evaluating
this capability are lacking. Existing datasets like FinQA and ConvFinQA
supervise only final numerical answers, without assessing intermediate
reasoning steps. To address this, we introduce FinChain, the first symbolic
benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning.
Spanning 54 topics across 12 financial domains, Fin- Chain offers five
parameterized templates per topic, each varying in reasoning complexity and
domain expertise required. Each dataset instance includes an executable Python
trace, enabling automatic generation of extensive training data and easy
adaptation to other domains. We also introduce ChainEval, a new metric for
automatic evaluation of both final answers and intermediate reasoning.
Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models
have considerable room for improvement in multi-step financial reasoning. All
templates and evaluation metrics for FinChain are available at https:
//github.com/mbzuai-nlp/finchain.