UCFE：面向用户的大型语言模型金融专业水平基准

摘要

本文介绍了UCFE：用户中心金融专业基准，这是一个创新框架，旨在评估大型语言模型（LLMs）处理复杂现实世界金融任务的能力。UCFE基准采用混合方法，将人类专家评估与动态、任务特定的交互相结合，以模拟不断发展的金融情景的复杂性。首先，我们进行了涉及804名参与者的用户研究，收集了他们对金融任务的反馈。其次，基于这些反馈，我们创建了涵盖广泛用户意图和交互的数据集。该数据集为使用LLM作为评判者方法对12个LLM服务进行基准测试奠定了基础。我们的结果显示，基准分数与人类偏好之间存在显著一致性，皮尔逊相关系数为0.78，证实了UCFE数据集和我们的评估方法的有效性。UCFE基准不仅揭示了LLMs在金融领域的潜力，还为评估它们的表现和用户满意度提供了一个强大的框架。基准数据集和评估代码可供使用。

English

This paper introduces the UCFE: User-Centric Financial Expertise benchmark, an innovative framework designed to evaluate the ability of large language models (LLMs) to handle complex real-world financial tasks. UCFE benchmark adopts a hybrid approach that combines human expert evaluations with dynamic, task-specific interactions to simulate the complexities of evolving financial scenarios. Firstly, we conducted a user study involving 804 participants, collecting their feedback on financial tasks. Secondly, based on this feedback, we created our dataset that encompasses a wide range of user intents and interactions. This dataset serves as the foundation for benchmarking 12 LLM services using the LLM-as-Judge methodology. Our results show a significant alignment between benchmark scores and human preferences, with a Pearson correlation coefficient of 0.78, confirming the effectiveness of the UCFE dataset and our evaluation approach. UCFE benchmark not only reveals the potential of LLMs in the financial sector but also provides a robust framework for assessing their performance and user satisfaction.The benchmark dataset and evaluation code are available.