預料之外的挑戰:金融領域的失效安全長文本問答
Expect the Unexpected: FailSafe Long Context QA for Finance
February 10, 2025
作者: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh
cs.AI
摘要
我們提出了一個新的長文本金融基準測試 FailSafeQA,旨在測試 LLMs 在金融領域的查詢-回答系統中對人機界面互動的六種變化的韌性和上下文感知能力。我們專注於兩個案例研究:查詢失敗和上下文失敗。在查詢失敗情境中,我們對原始查詢進行干擾,以變化領域專業知識、完整性和語言準確性。在上下文失敗案例中,我們模擬了降級、無關和空白文件的上傳。我們採用 LLM-作為評判的方法,使用 Qwen2.5-72B-Instruct 和細粒度評分標準來定義和計算 24 個現成模型的韌性、上下文基礎和合規性得分。結果表明,雖然一些模型擅長緩解輸入干擾,但它們必須在強大回答和避免妄想的能力之間取得平衡。值得注意的是,被認為是最合規模型的 Palmyra-Fin-128k-Instruct,在維持強大基準性能的同時,在 17% 的測試案例中遇到了維持韌性預測的挑戰。另一方面,最具韌性的模型 OpenAI o3-mini 在 41% 的測試案例中捏造了信息。結果表明,即使高性能模型也有顯著的改進空間,突顯了 FailSafeQA 在開發為金融應用中的可靠性而優化的 LLMs 工具的作用。數據集可在以下鏈接獲取:https://huggingface.co/datasets/Writer/FailSafeQA
English
We propose a new long-context financial benchmark, FailSafeQA, designed to
test the robustness and context-awareness of LLMs against six variations in
human-interface interactions in LLM-based query-answer systems within finance.
We concentrate on two case studies: Query Failure and Context Failure. In the
Query Failure scenario, we perturb the original query to vary in domain
expertise, completeness, and linguistic accuracy. In the Context Failure case,
we simulate the uploads of degraded, irrelevant, and empty documents. We employ
the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained
rating criteria to define and calculate Robustness, Context Grounding, and
Compliance scores for 24 off-the-shelf models. The results suggest that
although some models excel at mitigating input perturbations, they must balance
robust answering with the ability to refrain from hallucinating. Notably,
Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained
strong baseline performance but encountered challenges in sustaining robust
predictions in 17% of test cases. On the other hand, the most robust model,
OpenAI o3-mini, fabricated information in 41% of tested cases. The results
demonstrate that even high-performing models have significant room for
improvement and highlight the role of FailSafeQA as a tool for developing LLMs
optimized for dependability in financial applications. The dataset is available
at: https://huggingface.co/datasets/Writer/FailSafeQASummary
AI-Generated Summary