Ebisu:大型语言模型日语金融领域基准测试
Ebisu: Benchmarking Large Language Models in Japanese Finance
February 1, 2026
作者: Xueqing Peng, Ruoyu Xiang, Fan Zhang, Mingzi Song, Mingyang Jiang, Yan Wang, Lingfei Qian, Taiki Hara, Yuqing Guo, Jimin Huang, Junichi Tsujii, Sophia Ananiadou
cs.AI
摘要
日本金融領域融合了黏著語的頭韻結構、混合書寫系統,以及依賴間接表達與隱性承諾的高語境溝通規範,這對大語言模型構成重大挑戰。我們推出Ebisu基準測試——一個針對日語原生金融語言理解的評估體系,包含兩項基於語言文化背景並經專家標註的任務:JF-ICR任務評估面向投資者問答中的隱性承諾與拒絕識別能力,JF-TE任務則檢驗從專業披露文件中分層提取與排序嵌套金融術語的能力。我們評估了涵蓋通用型、日語適應型及金融專用模型在內的多種開源與專有大語言模型。結果表明,即使最先進的系統在兩項任務中均表現不佳。雖然擴大模型規模能帶來有限提升,但語言與領域專屬適配並未穩定改善性能,仍有顯著差距待解決。Ebisu為推進立足語言文化背景的金融自然語言處理提供了精準的基準測試框架,所有數據集與評估腳本均已公開釋出。
English
Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing Q&A, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.