FinChain：一個可驗證的鏈式思維金融推理的符號化基準

摘要

多步驟符號推理對於提升金融任務的下游表現至關重要。然而，目前缺乏系統評估此能力的基準。現有的數據集如FinQA和ConvFinQA僅監督最終的數值答案，而未評估中間的推理步驟。為解決這一問題，我們引入了FinChain，這是首個專為可驗證的思維鏈（Chain-of-Thought, CoT）金融推理設計的符號基準。FinChain涵蓋12個金融領域的54個主題，每個主題提供五個參數化模板，這些模板在推理複雜度和所需領域專業知識上各有不同。每個數據集實例都包含一個可執行的Python追蹤，使得能夠自動生成大量訓練數據，並易於適應其他領域。我們還引入了ChainEval，這是一種新的自動評估指標，用於評估最終答案和中間推理。在我們的數據集上對30個大型語言模型進行基準測試後，我們發現即使是最先進的模型在多步驟金融推理方面仍有很大的改進空間。FinChain的所有模板和評估指標均可通過https://github.com/mbzuai-nlp/finchain獲取。

English

Multi-step symbolic reasoning is critical for advancing downstream performance on financial tasks. Yet, benchmarks for systematically evaluating this capability are lacking. Existing datasets like FinQA and ConvFinQA supervise only final numerical answers, without assessing intermediate reasoning steps. To address this, we introduce FinChain, the first symbolic benchmark designed for verifiable Chain-of- Thought (CoT) financial reasoning. Spanning 54 topics across 12 financial domains, Fin- Chain offers five parameterized templates per topic, each varying in reasoning complexity and domain expertise required. Each dataset instance includes an executable Python trace, enabling automatic generation of extensive training data and easy adaptation to other domains. We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Benchmarking 30 LLMs on our dataset, we find that even state-of-the-art models have considerable room for improvement in multi-step financial reasoning. All templates and evaluation metrics for FinChain are available at https: //github.com/mbzuai-nlp/finchain.

FinChain：一個可驗證的鏈式思維金融推理的符號化基準

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

摘要

Support