편향 특성화: 간체 중국어 대 번체 중국어에서의 대형 언어 모델 벤치마킹

초록

대형 언어 모델(LLM)의 능력은 간체자와 번체자 중국어 모두에서 연구되어 왔으나, 이 두 가지 서체의 중국어로 프롬프트를 제공할 때 LLM이 차별적인 성능을 보이는지 여부는 아직 명확하지 않다. 이러한 이해는 매우 중요하다. 왜냐하면 LLM 응답의 질적 차이는 간체자와 번체자 중국어가 내포하는 서로 다른 문화적 맥락을 무시함으로써 대표성 피해를 초래할 수 있으며, 교육이나 채용과 같은 분야에서 LLM이 지원하는 의사결정 과정에서 하류 피해를 악화시킬 수 있기 때문이다. 잠재적인 LLM 성능 차이를 조사하기 위해, 우리는 실제 시나리오를 반영한 두 가지 벤치마크 과제를 설계하였다: 지역적 용어 선택(LLM에게 중국 본토와 대만에서 다르게 불리는 물건의 이름을 지어달라고 요청)과 지역적 이름 선택(LLM에게 간체자와 번체자 중국어로 작성된 이름 목록에서 누구를 채용할지 선택하도록 요청). 두 과제 모두에서 우리는 주로 영어, 간체자 중국어, 또는 번체자 중국어로 훈련된 11개의 주요 상용 LLM 서비스와 오픈소스 모델의 성능을 감사하였다. 우리의 분석은 LLM 응답의 편향이 과제와 프롬프트 언어 모두에 의존한다는 것을 보여준다: 대부분의 LLM이 지역적 용어 선택 과제에서는 간체자 중국어 응답을 불균형적으로 선호한 반면, 지역적 이름 선택 과제에서는 놀랍게도 번체자 중국어 이름을 선호했다. 이러한 차이는 훈련 데이터 표현, 서체 선호도, 간체자와 번체자 중국어의 토큰화 방식의 차이에서 비롯될 수 있음을 발견했다. 이러한 결과는 LLM 편향에 대한 추가 분석의 필요성을 강조한다. 이에 따라 우리는 중국어 변형 간의 향후 LLM 행동에 대한 재현 가능한 평가를 촉진하기 위해 오픈소스 벤치마크 데이터셋을 제공한다(https://github.com/brucelyu17/SC-TC-Bench).

English

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).

편향 특성화: 간체 중국어 대 번체 중국어에서의 대형 언어 모델 벤치마킹

Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

초록

Support