ChatPaper.aiChatPaper

以一致性訓練減少政治操弄

Reducing Political Manipulation with Consistency Training

May 28, 2026
作者: Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja, Dan Hendrycks
cs.AI

摘要

大型語言模型(LLMs)在各種敏感情境中展現出系統性的政治偏見。我們發現,LLMs 在處理來自不同政治立場的對應議題時,會呈現不對稱的表現。我們將此現象稱為「隱性政治偏見」,並歸納出七種運作技術類別。我們為隱性偏見提出兩項衡量指標:情感一致性(衡量對立政治配對提示中,修辭與框架的對稱性);助益一致性(衡量回應深度與參與度的對稱性)。為減少這兩類隱性偏見,我們提出政治一致性訓練(PCT),這是一種結合兩種互補範式的強化學習訓練方法:情感一致性訓練與助益一致性訓練。我們證明 PCT 能維持整體助益性、顯著降低隱性政治偏見,並可推廣至未見的基準測試。我們的研究成果已開源於 https://political-manipulation.ai。
English
Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai