以數據為中心的金融大型語言模型

摘要

大型語言模型（LLMs）在自然語言任務上表現出潛力，但直接應用於金融等複雜領域時會遇到困難。LLMs在推理和整合所有相關信息方面遇到困難。我們提出了一種以數據為中心的方法，以使LLMs更好地處理金融任務。我們的關鍵見解是，與其一次性將所有信息壓倒LLM，不如對數據進行預處理和預理解更為有效。我們使用多任務提示式微調來創建金融LLM（FLLM），以實現數據預處理和預理解。然而，每個任務的標記數據很少。為了克服手動標註成本，我們採用演繹增強推理（AAR）來通過修改FLLM自身輸出的虛標籤來自動生成訓練數據。實驗表明，我們基於數據的FLLM與AAR顯著優於為原始文本設計的基準金融LLMs，在金融分析和解釋任務上實現了最先進的水平。我們還開源了一個新的金融分析和解釋基準。我們的方法為發掘LLMs在複雜現實世界領域的潛力提供了一條有前途的途徑。

English

Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance. LLMs have difficulty reasoning about and integrating all relevant information. We propose a data-centric approach to enable LLMs to better handle financial tasks. Our key insight is that rather than overloading the LLM with everything at once, it is more effective to preprocess and pre-understand the data. We create a financial LLM (FLLM) using multitask prompt-based finetuning to achieve data pre-processing and pre-understanding. However, labeled data is scarce for each task. To overcome manual annotation costs, we employ abductive augmentation reasoning (AAR) to automatically generate training data by modifying the pseudo labels from FLLM's own outputs. Experiments show our data-centric FLLM with AAR substantially outperforms baseline financial LLMs designed for raw text, achieving state-of-the-art on financial analysis and interpretation tasks. We also open source a new benchmark for financial analysis and interpretation. Our methodology provides a promising path to unlock LLMs' potential for complex real-world domains.

以數據為中心的金融大型語言模型

Data-Centric Financial Large Language Models

摘要

Support