정치적 조작 감소를 위한 일관성 훈련

초록

대규모 언어 모델(LLM)은 다양한 민감한 맥락에서 체계적인 정치적 편향을 나타냅니다. 우리는 LLM이 반대 정치 진영의 대응 주제를 비대칭적으로 처리한다는 것을 발견했습니다. 이 현상을 은밀한 정치적 편향이라고 부르며, 이를 작동시키는 7가지 기술 범주를 식별했습니다. 우리는 은밀한 편향을 측정하기 위한 두 가지 지표를 제안합니다: 감정 일관성은 짝을 이룬 정치적 프롬프트 간 수사와 프레이밍의 대칭성을 측정하고, 유용성 일관성은 대칭적인 깊이와 참여도를 측정합니다. 두 가지 유형의 은밀한 편향을 줄이기 위해, 우리는 정치적 일관성 훈련(PCT)을 도입합니다. 이는 감정 일관성 훈련과 유용성 일관성 훈련이라는 두 가지 보완적 패러다임을 가진 강화 학습 훈련 방법입니다. 우리는 PCT가 전반적인 유용성을 유지하고, 은밀한 정치적 편향을 실질적으로 줄이며, 보류된 벤치마크에 일반화됨을 보여줍니다. 우리의 작업을 https://political-manipulation.ai에서 공개합니다.

English

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai