FinLFQA:評估大型語言模型在金融長篇問答中的歸因文本生成能力
FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
October 7, 2025
作者: Yitao Long, Tiansheng Hu, Yilun Zhao, Arman Cohan, Chen Zhao
cs.AI
摘要
大型語言模型(LLMs)在回答長篇問題時經常出現幻覺,產生看似合理但實際上錯誤的答案。一種常見的緩解策略是為LLM的輸出提供歸屬。然而,現有的基準測試主要集中於簡單的歸屬,即檢索支持性的文本證據作為參考。我們認為,在現實世界的場景中,如金融應用,歸屬超越了參考檢索。我們引入了FinLFQA,這是一個旨在評估LLMs生成複雜金融問題長篇答案並提供可靠且細緻歸屬能力的基準。FinLFQA通過人工註釋評估歸屬的三個關鍵方面:(1)從財務報告中提取的支持證據,(2)中間的數值推理步驟,以及(3)指導推理過程的特定領域金融知識。我們還提供了一個自動評估框架,涵蓋答案質量和歸屬質量。通過對八種LLMs在多種歸屬生成範式上的廣泛實驗,我們發現細粒度指標對於區分模型能力至關重要,端到端生成與事後處理方法相比具有相當的性能,而迭代改進僅在外部反饋指導下才有所幫助。
English
Large Language Models (LLMs) frequently hallucinate to long-form questions,
producing plausible yet factually incorrect answers. A common mitigation
strategy is to provide attribution to LLM outputs. However, existing benchmarks
primarily focus on simple attribution that retrieves supporting textual
evidence as references. We argue that in real-world scenarios such as financial
applications, attribution goes beyond reference retrieval. We introduce
FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate
long-form answers to complex financial questions with reliable and nuanced
attributions. FinLFQA evaluates three critical aspects of attribution through
human annotations: (1) supporting evidence extracted from financial reports,
(2) intermediate numerical reasoning steps, and (3) domain-specific financial
knowledge that informs the reasoning process. We further provide an automatic
evaluation framework covering both answer quality and attribution quality.
Through extensive experiments on eight LLMs across multiple
attribution-generation paradigms, we find that fine-grained metrics are
important to distinguish model capabilities, that end-to-end generation
achieves comparable performance to post-hoc approaches, and that iterative
refinement only helps when guided by external feedback.