預料之外的挑戰：金融領域的失效安全長文本問答

摘要

我們提出了一個新的長文本金融基準測試 FailSafeQA，旨在測試 LLMs 在金融領域的查詢-回答系統中對人機界面互動的六種變化的韌性和上下文感知能力。我們專注於兩個案例研究：查詢失敗和上下文失敗。在查詢失敗情境中，我們對原始查詢進行干擾，以變化領域專業知識、完整性和語言準確性。在上下文失敗案例中，我們模擬了降級、無關和空白文件的上傳。我們採用 LLM-作為評判的方法，使用 Qwen2.5-72B-Instruct 和細粒度評分標準來定義和計算 24 個現成模型的韌性、上下文基礎和合規性得分。結果表明，雖然一些模型擅長緩解輸入干擾，但它們必須在強大回答和避免妄想的能力之間取得平衡。值得注意的是，被認為是最合規模型的 Palmyra-Fin-128k-Instruct，在維持強大基準性能的同時，在 17% 的測試案例中遇到了維持韌性預測的挑戰。另一方面，最具韌性的模型 OpenAI o3-mini 在 41% 的測試案例中捏造了信息。結果表明，即使高性能模型也有顯著的改進空間，突顯了 FailSafeQA 在開發為金融應用中的可靠性而優化的 LLMs 工具的作用。數據集可在以下鏈接獲取：https://huggingface.co/datasets/Writer/FailSafeQA

English

We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA

預料之外的挑戰：金融領域的失效安全長文本問答

Expect the Unexpected: FailSafe Long Context QA for Finance

摘要

Support