ChatPaper.aiChatPaper

通过一致性训练减少政治操纵

Reducing Political Manipulation with Consistency Training

May 28, 2026
作者: Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja, Dan Hendrycks
cs.AI

摘要

大型语言模型(LLMs)在各种敏感语境中表现出系统性的政治偏见。我们发现,LLMs在处理来自对立政治立场的对应话题时存在不对称性。我们将这一现象称为隐性政治偏见,并识别出7类实现此偏见的操作手法。我们提出两种隐性偏见的衡量指标:情感一致性指标衡量配对政治提示中修辞和框架的对称性;有益性一致性指标衡量回应深度和参与度的对称性。为减少这两种隐性偏见,我们引入政治一致性训练(PCT),这是一种包含两种互补范式的强化学习训练方法:情感一致性训练和有益性一致性训练。研究表明,PCT在保持整体有益性的同时,显著降低了隐性政治偏见,并能泛化至保留的基准测试。我们在 https://political-manipulation.ai 公开了相关研究成果。
English
Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai