ChatPaper.aiChatPaper

LLM評判者的跨語言穩定性受控生成研究:芬蘭-烏戈爾語族的實證證據

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

February 2, 2026
作者: Isaac Chung, Linda Freienthal
cs.AI

摘要

大型语言模型(LLM)的跨语言评估通常混淆了两个变异源:真实的模型性能差异与测量不稳定性。我们通过固定生成条件、变换目标语言来探究评估可靠性。利用在爱沙尼亚语、芬兰语和匈牙利语中采用相同参数生成的合成客服对话数据,我们检验了自动指标与LLM-as-a-judge评分能否在这三种形态丰富、同属芬兰-乌戈尔语系的语言间产生稳定的模型排名。以少量爱沙尼亚语母语者标注为参照,我们发现系统性排名不稳定现象:表层指标(词汇多样性、表层及语义相似度)保持跨语言稳定性,但语用判断(连贯性、指令遵循度)出现排名倒置及接近零相关。由于生成过程受控,这些不一致反映的是评判标准在不同语言间的差异性表现,而非真实的模型差异。 这种受控设计提供了一种诊断工具:在相同生成条件下无法保持稳定性的评估方法,预示着部署前存在迁移失败风险。我们的研究结果表明,零样本评判迁移对于形态丰富语言的语篇级评估不可靠,这强调需要针对特定语言以人类标注为基准进行校准。我们在https://github.com/isaac-chung/cross-lingual-stability-judges 发布了受控生成协议、合成数据及评估框架,以支持在不同语系中复现研究。
English
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.
PDF12February 7, 2026