Ebisu:大型语言模型在日语金融领域的基准评测
Ebisu: Benchmarking Large Language Models in Japanese Finance
February 1, 2026
作者: Xueqing Peng, Ruoyu Xiang, Fan Zhang, Mingzi Song, Mingyang Jiang, Yan Wang, Lingfei Qian, Taiki Hara, Yuqing Guo, Jimin Huang, Junichi Tsujii, Sophia Ananiadou
cs.AI
摘要
日本金融领域融合了黏着语序的末尾核心语法结构、混合书写体系以及依赖间接表达与隐性承诺的高语境交流规范,这对大语言模型构成显著挑战。我们推出Ebisu基准测试——针对本土日语金融语言理解的评估体系,包含两项基于语言文化特性并由专家标注的任务:JF-ICR任务通过投资者问答场景评估隐性承诺与婉拒识别能力,JF-TE任务则从专业披露文件中检验嵌套金融术语的层级提取与排序能力。我们对涵盖通用型、日语优化型及金融专用型在内的多类开源与商用大模型进行测试。结果表明,即使最先进的系统在两项任务中均表现不佳。虽然扩大模型规模能带来有限提升,但针对语言和领域的专门优化并未稳定改善性能,仍有显著差距亟待解决。Ebisu为推进基于语言文化特性的金融自然语言处理研究提供了精准的基准框架,所有数据集与评估代码均已公开。
English
Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing Q&A, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.