FinChain：一个可验证链式思维金融推理的符号化基准

摘要

多步符号推理对于提升金融任务的下游性能至关重要。然而，系统评估这一能力的基准尚显不足。现有数据集如FinQA和ConvFinQA仅监督最终数值答案，而未评估中间推理步骤。为此，我们推出了FinChain，这是首个专为可验证的思维链（CoT）金融推理设计的符号基准。FinChain涵盖12个金融领域的54个主题，每个主题提供五种参数化模板，这些模板在推理复杂性和所需领域专业知识上各不相同。每个数据集实例均包含可执行的Python跟踪，便于自动生成大量训练数据，并易于适应其他领域。我们还引入了ChainEval，这是一种用于自动评估最终答案和中间推理的新指标。在我们的数据集上对30个大型语言模型进行基准测试，发现即使是当前最先进的模型，在多步金融推理方面仍有显著提升空间。FinChain的所有模板和评估指标均可在https://github.com/mbzuai-nlp/finchain获取。

English

Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.

FinChain：一个可验证链式思维金融推理的符号化基准

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

摘要

Support