ChatPaper.aiChatPaper

**主题:完美AI对齐的复杂性——形式化RLHF的三难困境**

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

November 23, 2025
作者: Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
cs.AI

摘要

基于人类反馈的强化学习(RLHF)被广泛用于对齐大语言模型,但实践者始终面临一个难题:提升安全性往往会降低公平性,扩展到多样化群体时计算会变得难以处理,而增强系统鲁棒性又常常放大主流群体偏见。我们将这种张力形式化为对齐三元悖论:任何RLHF系统都无法同时实现(i)跨多元人类价值观的ε-代表性,(ii)样本与计算复杂度的多项式可处理性,以及(iii)对抗性扰动与分布偏移的δ-鲁棒性。通过融合统计学习理论与鲁棒优化的复杂性理论分析,我们证明要实现全球规模人群的代表性(ε≤0.01)和鲁棒性(δ≤0.001)需要Ω(2^{d_context})量级的运算量,这在上下文维度上呈超多项式增长。研究发现当前RLHF实现通过牺牲代表性来解决这一悖论:它们仅从同质化标注群体收集10^3-10^4个样本,而真实全球代表性需要10^7-10^8个样本。我们的框架为已记录的RLHF缺陷(包括偏好坍缩、谄媚行为和系统性偏见放大)提供了统一解释。最后提出了通过策略性放宽对齐要求来应对这些根本性权衡的具体方向。
English
Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.
PDF12December 1, 2025