ProgressGym:与千年道德进步的契合
ProgressGym: Alignment with a Millennium of Moral Progress
June 28, 2024
作者: Tianyi Qiu, Yang Zhang, Xuchuan Huang, Jasmine Xinze Li, Jiaming Ji, Yaodong Yang
cs.AI
摘要
前沿的人工智能系统,包括大型语言模型(LLMs),对人类用户的认识论产生越来越大的影响。这种影响可以强化当前社会价值观,潜在地导致错误道德信念的固化,从而在广泛范围内延续问题性道德实践。我们提出进步对齐作为一种技术解决方案,以减轻这一即将到来的风险。进步对齐算法学习模仿人类道德进步的机制,从而解决现有对齐方法对当代道德盲点的敏感性。为促进进步对齐领域的研究,我们引入了ProgressGym,一个实验框架,允许从历史中学习道德进步的机制,以促进未来在现实世界道德决策中的进步。利用9个世纪的历史文本和18个历史LLMs,ProgressGym使得将现实世界的进步对齐挑战编码为具体基准成为可能。具体而言,我们介绍了三个核心挑战:追踪不断演变的价值观(PG-Follow)、预测道德进步(PG-Predict)以及调节人类和人工智能价值转变之间的反馈循环(PG-Coevolve)。没有时间维度的对齐方法无法应用于这些任务。作为回应,我们提出了终身学习和外推算法作为进步对齐的基线方法,并建立了一个开放的排行榜,征集新颖的算法和挑战。该框架和排行榜分别可在以下链接找到:https://github.com/PKU-Alignment/ProgressGym 和 https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard。
English
Frontier AI systems, including large language models (LLMs), hold increasing
influence over the epistemology of human users. Such influence can reinforce
prevailing societal values, potentially contributing to the lock-in of
misguided moral beliefs and, consequently, the perpetuation of problematic
moral practices on a broad scale. We introduce progress alignment as a
technical solution to mitigate this imminent risk. Progress alignment
algorithms learn to emulate the mechanics of human moral progress, thereby
addressing the susceptibility of existing alignment methods to contemporary
moral blindspots. To empower research in progress alignment, we introduce
ProgressGym, an experimental framework allowing the learning of moral progress
mechanics from history, in order to facilitate future progress in real-world
moral decisions. Leveraging 9 centuries of historical text and 18 historical
LLMs, ProgressGym enables codification of real-world progress alignment
challenges into concrete benchmarks. Specifically, we introduce three core
challenges: tracking evolving values (PG-Follow), preemptively anticipating
moral progress (PG-Predict), and regulating the feedback loop between human and
AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension
are inapplicable to these tasks. In response, we present lifelong and
extrapolative algorithms as baseline methods of progress alignment, and build
an open leaderboard soliciting novel algorithms and challenges. The framework
and the leaderboard are available at
https://github.com/PKU-Alignment/ProgressGym and
https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard
respectively.Summary
AI-Generated Summary