以一致性訓練減少政治操弄

摘要

大型語言模型（LLMs）在各種敏感情境中展現出系統性的政治偏見。我們發現，LLMs 在處理來自不同政治立場的對應議題時，會呈現不對稱的表現。我們將此現象稱為「隱性政治偏見」，並歸納出七種運作技術類別。我們為隱性偏見提出兩項衡量指標：情感一致性（衡量對立政治配對提示中，修辭與框架的對稱性）；助益一致性（衡量回應深度與參與度的對稱性）。為減少這兩類隱性偏見，我們提出政治一致性訓練（PCT），這是一種結合兩種互補範式的強化學習訓練方法：情感一致性訓練與助益一致性訓練。我們證明 PCT 能維持整體助益性、顯著降低隱性政治偏見，並可推廣至未見的基準測試。我們的研究成果已開源於 https://political-manipulation.ai。

English

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai