データ中心型金融大規模言語モデル

要旨

大規模言語モデル（LLM）は自然言語タスクにおいて有望であるが、金融のような複雑な領域に直接適用する際には困難を抱えている。LLMは関連する情報を推論し統合するのに苦労する。我々は、LLMが金融タスクをより適切に処理できるようにするためのデータ中心のアプローチを提案する。重要な洞察は、LLMに一度にすべてを詰め込むのではなく、データを前処理し事前に理解することがより効果的であるという点だ。我々は、マルチタスクプロンプトベースのファインチューニングを用いてデータの前処理と事前理解を実現する金融LLM（FLLM）を作成した。しかし、各タスクに対するラベル付きデータは不足している。手動アノテーションのコストを克服するため、FLLMの出力から得られた疑似ラベルを修正することでトレーニングデータを自動生成する帰納的拡張推論（AAR）を採用した。実験結果は、AARを組み込んだデータ中心のFLLMが、生のテキスト用に設計されたベースラインの金融LLMを大幅に上回り、金融分析および解釈タスクにおいて最先端の性能を達成することを示している。また、金融分析および解釈のための新しいベンチマークをオープンソースとして公開した。我々の方法論は、複雑な現実世界の領域におけるLLMの可能性を引き出すための有望な道筋を提供する。

English

Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance. LLMs have difficulty reasoning about and integrating all relevant information. We propose a data-centric approach to enable LLMs to better handle financial tasks. Our key insight is that rather than overloading the LLM with everything at once, it is more effective to preprocess and pre-understand the data. We create a financial LLM (FLLM) using multitask prompt-based finetuning to achieve data pre-processing and pre-understanding. However, labeled data is scarce for each task. To overcome manual annotation costs, we employ abductive augmentation reasoning (AAR) to automatically generate training data by modifying the pseudo labels from FLLM's own outputs. Experiments show our data-centric FLLM with AAR substantially outperforms baseline financial LLMs designed for raw text, achieving state-of-the-art on financial analysis and interpretation tasks. We also open source a new benchmark for financial analysis and interpretation. Our methodology provides a promising path to unlock LLMs' potential for complex real-world domains.

データ中心型金融大規模言語モデル

Data-Centric Financial Large Language Models

要旨

Support