LLM评判员的跨语言稳定性在受控生成下的表现：来自芬兰-乌戈尔语族的证据

摘要

大型语言模型（LLM）的跨语言评估通常混淆了两个变异源：真实的模型性能差异与测量不稳定性。我们通过固定生成条件、变换目标语言来研究评估可靠性。利用在爱沙尼亚语、芬兰语和匈牙利语中采用相同参数生成的合成客服对话数据，我们检验自动指标与LLM即评委评分能否在这三种形态丰富的亲属芬兰-乌戈尔语系语言间产生稳定的模型排序。以少量爱沙尼亚语母语者标注为参照，我们发现系统性的排序不稳定性：表层指标（词汇多样性、表层及语义相似度）保持跨语言稳定性，但语用判断（连贯性、指令遵循度）出现排序倒置和接近零相关性。由于生成参数受控，这些不一致反映的是评委评分跨语言行为的差异，而非真实的模型差距。这一受控实验设计提供了诊断工具：在相同生成条件下无法保持稳定性的评估方法，预示着部署前存在迁移失败风险。我们的研究结果表明，零样本评委迁移对于形态丰富语言的语篇级评估不可靠，亟需针对特定语言参照人工基线进行校准。我们在https://github.com/isaac-chung/cross-lingual-stability-judges 发布了受控生成方案、合成数据与评估框架，以支持跨语系复现研究。

English

Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.

LLM评判员的跨语言稳定性在受控生成下的表现：来自芬兰-乌戈尔语族的证据

Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

摘要

Support