FinLFQA: 금융 분야 장문형 질의응답에서 대형 언어 모델의 속성 기반 텍스트 생성 평가

초록

대형 언어 모델(LLMs)은 장문의 질문에 대해 종종 사실적으로 틀린 듯 보이는 답변을 생성하는 환각 현상을 보입니다. 이를 완화하기 위한 일반적인 전략은 LLM 출력에 출처를 제공하는 것입니다. 그러나 기존 벤치마크는 주로 참조 자료로 사용할 수 있는 텍스트 증거를 검색하는 단순한 출처 제공에 초점을 맞추고 있습니다. 우리는 금융 애플리케이션과 같은 실제 시나리오에서 출처 제공이 단순한 참조 검색을 넘어선다고 주장합니다. 이에 따라 우리는 복잡한 금융 질문에 대해 신뢰할 수 있고 세밀한 출처를 포함한 장문 답변을 생성하는 LLM의 능력을 평가하기 위해 FinLFQA 벤치마크를 소개합니다. FinLFQA는 인간 주석을 통해 출처 제공의 세 가지 중요한 측면을 평가합니다: (1) 금융 보고서에서 추출한 지원 증거, (2) 중간 수치 추론 단계, (3) 추론 과정을 알려주는 도메인 특화 금융 지식. 또한, 답변 품질과 출처 제공 품질을 모두 포괄하는 자동 평가 프레임워크를 제공합니다. 다양한 출처 생성 패러다임을 적용한 8개의 LLM에 대한 광범위한 실험을 통해, 우리는 세분화된 지표가 모델 능력을 구분하는 데 중요하며, 엔드투엔드 생성이 사후 접근 방식과 비슷한 성능을 달성하고, 반복적 개선은 외부 피드백이 있을 때만 도움이 된다는 것을 발견했습니다.

English

Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

FinLFQA: 금융 분야 장문형 질의응답에서 대형 언어 모델의 속성 기반 텍스트 생성 평가

FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

초록

Support