偏差特征化：大型语言模型在简体与繁体中文中的基准测试

摘要

尽管大型语言模型（LLMs）在简体中文和繁体中文中的能力已被广泛研究，但尚不清楚当以这两种书面中文变体提示时，LLMs是否表现出不同的性能。这一理解至关重要，因为LLM响应质量的差异可能会忽视简体中文与繁体中文背后不同的文化背景，从而延续代表性伤害，并在教育或招聘等领域的LLM辅助决策中加剧下游伤害。为了调查潜在的LLM性能差异，我们设计了两个反映现实场景的基准任务：区域术语选择（提示LLM为描述的项目命名，该项目在中国大陆和台湾有不同的称呼）和区域姓名选择（提示LLM从简体中文和繁体中文的姓名列表中选择雇佣对象）。对于这两项任务，我们审计了11个领先的商业LLM服务和开源模型的性能——涵盖主要训练于英语、简体中文或繁体中文的模型。我们的分析表明，LLM响应中的偏见既依赖于任务也依赖于提示语言：在区域术语选择任务中，大多数LLM不成比例地偏向简体中文响应，而在区域姓名选择任务中，它们却出人意料地偏向繁体中文姓名。我们发现这些差异可能源于训练数据表示、书写字符偏好以及简体中文和繁体中文分词方式的差异。这些发现强调了进一步分析LLM偏见的必要性；因此，我们提供了一个开源的基准数据集，以促进未来LLM在中文变体间行为的可重复评估（https://github.com/brucelyu17/SC-TC-Bench）。

English

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

偏差特征化：大型语言模型在简体与繁体中文中的基准测试

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

摘要

Support