LLM评判员的跨语言稳定性在受控生成下的表现:来自芬兰-乌戈尔语族的证据
Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
February 2, 2026
作者: Isaac Chung, Linda Freienthal
cs.AI
摘要
大型语言模型(LLM)的跨语言评估通常混淆了两个变异源:真实的模型性能差异与测量不稳定性。我们通过固定生成条件、变换目标语言来研究评估可靠性。利用在爱沙尼亚语、芬兰语和匈牙利语中采用相同参数生成的合成客服对话数据,我们检验自动指标与LLM即评委评分能否在这三种形态丰富的亲属芬兰-乌戈尔语系语言间产生稳定的模型排序。以少量爱沙尼亚语母语者标注为参照,我们发现系统性的排序不稳定性:表层指标(词汇多样性、表层及语义相似度)保持跨语言稳定性,但语用判断(连贯性、指令遵循度)出现排序倒置和接近零相关性。由于生成参数受控,这些不一致反映的是评委评分跨语言行为的差异,而非真实的模型差距。
这一受控实验设计提供了诊断工具:在相同生成条件下无法保持稳定性的评估方法,预示着部署前存在迁移失败风险。我们的研究结果表明,零样本评委迁移对于形态丰富语言的语篇级评估不可靠,亟需针对特定语言参照人工基线进行校准。我们在https://github.com/isaac-chung/cross-lingual-stability-judges 发布了受控生成方案、合成数据与评估框架,以支持跨语系复现研究。
English
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences.
This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.