FinLFQA：評估大型語言模型在金融長篇問答中的歸因文本生成能力

摘要

大型語言模型（LLMs）在回答長篇問題時經常出現幻覺，產生看似合理但實際上錯誤的答案。一種常見的緩解策略是為LLM的輸出提供歸屬。然而，現有的基準測試主要集中於簡單的歸屬，即檢索支持性的文本證據作為參考。我們認為，在現實世界的場景中，如金融應用，歸屬超越了參考檢索。我們引入了FinLFQA，這是一個旨在評估LLMs生成複雜金融問題長篇答案並提供可靠且細緻歸屬能力的基準。FinLFQA通過人工註釋評估歸屬的三個關鍵方面：（1）從財務報告中提取的支持證據，（2）中間的數值推理步驟，以及（3）指導推理過程的特定領域金融知識。我們還提供了一個自動評估框架，涵蓋答案質量和歸屬質量。通過對八種LLMs在多種歸屬生成範式上的廣泛實驗，我們發現細粒度指標對於區分模型能力至關重要，端到端生成與事後處理方法相比具有相當的性能，而迭代改進僅在外部反饋指導下才有所幫助。

English

Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

FinLFQA：評估大型語言模型在金融長篇問答中的歸因文本生成能力

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

摘要

Support