FinLFQA：评估大语言模型在金融长文本问答中的归因文本生成能力

摘要

大型语言模型（LLMs）在应对长篇问题时常常产生幻觉，给出看似合理实则错误的答案。一种常见的缓解策略是为LLM输出提供来源标注。然而，现有基准主要集中于简单的来源标注，即检索支持性文本证据作为参考。我们认为，在诸如金融应用等现实场景中，来源标注远不止于参考检索。为此，我们引入了FinLFQA，一个旨在评估LLMs针对复杂金融问题生成长篇回答并附带可靠且细致来源标注能力的基准。FinLFQA通过人工标注评估了来源标注的三个关键方面：（1）从财务报告中提取的支持性证据，（2）中间数值推理步骤，以及（3）指导推理过程的领域特定金融知识。此外，我们还提供了一个自动评估框架，涵盖答案质量和来源标注质量两方面。通过对八种LLM在多种来源生成范式下的广泛实验，我们发现细粒度指标对于区分模型能力至关重要，端到端生成与事后处理方法的性能相当，而迭代优化仅在外界反馈指导下才有效。

English

Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

FinLFQA：评估大语言模型在金融长文本问答中的归因文本生成能力

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

摘要

Support