**议题:完美AI对齐的复杂性——强化学习人类反馈三难困境的形式化探讨**
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
November 23, 2025
作者: Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
cs.AI
摘要
基於人類回饋的強化學習(RLHF)被廣泛用於對齊大型語言模型,但實踐者始終面臨一個難題:提升安全性往往會削弱公平性,擴展至多元群體時會面臨計算不可行性,而增強系統魯棒性又會放大主流群體偏見。我們將這種矛盾形式化定義為「對齊三元悖論」:任何RLHF系統都無法同時實現(i)對多元人類價值觀的ε-代表性,(ii)樣本與計算複雜度的多項式可處理性,以及(iii)針對對抗性擾動與分佈偏移的δ-魯棒性。通過融合統計學習理論與魯棒優化的複雜度理論分析,我們證明要實現全球尺度人群的代表性(ε≤0.01)與魯棒性(δ≤0.001),需要Ω(2^{d_context})量級的運算量,其隨上下文維度呈超多項式增長。研究顯示當前RLHF實踐通過犧牲代表性來化解此悖論:僅從同質化標註群體採集10^3–10^4份樣本,而真實全球代表性需要10^7–10^8份樣本。我們的框架為RLHF現有缺陷(包括偏好坍塌、諂媚效應及系統性偏見放大)提供了統一解釋。最後提出通過策略性放寬對齊要求來應對這些根本性權衡的具體路徑。
English
Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.