一貫性訓練による政治的操作の低減

要旨

大規模言語モデル（LLM）は、さまざまな感受性の高い文脈において体系的な政治的バイアスを示す。我々は、LLMが政治的に対立する立場の話題を非対称に扱うことを発見した。この現象を「隠れた政治的バイアス」と呼び、それが作用する7つのカテゴリーの手法を特定する。また、隠れたバイアスに対する2つの指標を提案する。「感情一貫性」は、対となる政治的プロンプト間における修辞やフレーミングの対称性を測定し、「有用性一貫性」は、応答の深さや関与度の対称性を測定する。これら両方のタイプの隠れたバイアスを低減するために、我々は「政治的整合性トレーニング（PCT）」を導入する。これは、2つの相補的なパラダイム、「感情一貫性トレーニング」と「有用性一貫性トレーニング」から成る強化学習訓練手法である。PCTは全体的な有用性を維持しつつ、隠れた政治的バイアスを大幅に低減し、未見のベンチマークに対しても汎化することを示す。本研究成果は https://political-manipulation.ai で公開している。

English

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai