バイアスの特性評価：簡体字中国語と繁体字中国語における大規模言語モデルのベンチマーキング

要旨

大規模言語モデル（LLM）の能力は簡体字中国語と繁体字中国語の両方で研究されてきたが、LLMがこれら2つの書体の中国語でプロンプトされた際に異なる性能を示すかどうかはまだ明らかではない。この理解は重要である。なぜなら、LLMの応答品質における差異は、簡体字中国語と繁体字中国語の背後にある異なる文化的文脈を無視することで表象的害を永続させ、教育や採用などの分野におけるLLMを介した意思決定において下流の害を悪化させる可能性があるからだ。潜在的なLLMの性能差異を調査するため、我々は現実世界のシナリオを反映した2つのベンチマークタスクを設計した：地域用語選択（LLMに、中国本土と台湾で異なる名称で呼ばれる項目を命名させる）と地域名選択（LLMに、簡体字と繁体字中国語の両方で記載された名前リストから採用する人物を選択させる）。両タスクにおいて、英語、簡体字中国語、または繁体字中国語を主に訓練された11の主要な商用LLMサービスとオープンソースモデルの性能を監査した。我々の分析によると、LLMの応答におけるバイアスはタスクとプロンプト言語の両方に依存している：ほとんどのLLMは地域用語選択タスクで簡体字中国語の応答を過剰に優先したが、驚くべきことに地域名選択タスクでは繁体字中国語の名前を優先した。これらの差異は、訓練データの表現、文字の選好、および簡体字と繁体字中国語のトークン化の違いから生じる可能性があることがわかった。これらの発見は、LLMのバイアスをさらに分析する必要性を強調している。そのため、我々はオープンソースのベンチマークデータセットを提供し、将来のLLMの中国語書体間の振る舞いの再現可能な評価を促進する（https://github.com/brucelyu17/SC-TC-Bench）。

English

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

バイアスの特性評価：簡体字中国語と繁体字中国語における大規模言語モデルのベンチマーキング

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

要旨

Support