ProgressGym:與千年道德進步的一致
ProgressGym: Alignment with a Millennium of Moral Progress
June 28, 2024
作者: Tianyi Qiu, Yang Zhang, Xuchuan Huang, Jasmine Xinze Li, Jiaming Ji, Yaodong Yang
cs.AI
摘要
前沿的人工智慧系統,包括大型語言模型(LLMs),對人類使用者的認識論產生越來越大的影響。這種影響可以強化現存社會價值觀,潛在地促成錯誤道德信念的固化,進而在廣泛範圍內延續問題性的道德實踐。我們提出進展對齊作為一種技術解決方案,以減輕這一迫在眉睫的風險。進展對齊演算法學習模擬人類道德進步的機制,從而應對現有對齊方法對當代道德盲點的敏感性。為了促進進展對齊研究,我們引入ProgressGym,這是一個實驗性框架,允許從歷史中學習道德進步的機制,以便促進未來現實世界中的道德決策進展。利用9個世紀的歷史文本和18個歷史LLMs,ProgressGym使得將現實世界中的進展對齊挑戰編碼為具體基準成為可能。具體而言,我們提出三個核心挑戰:追蹤價值觀的演變(PG-Follow)、預先預測道德進步(PG-Predict)以及調節人類和人工智慧價值轉變之間的反饋循環(PG-Coevolve)。沒有時間維度的對齊方法無法應用於這些任務。為此,我們提出終身學習和外推演算法作為進展對齊的基準方法,並建立一個開放排行榜,徵求新穎的演算法和挑戰。這個框架和排行榜分別可在以下網址找到:https://github.com/PKU-Alignment/ProgressGym 和 https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard。
English
Frontier AI systems, including large language models (LLMs), hold increasing
influence over the epistemology of human users. Such influence can reinforce
prevailing societal values, potentially contributing to the lock-in of
misguided moral beliefs and, consequently, the perpetuation of problematic
moral practices on a broad scale. We introduce progress alignment as a
technical solution to mitigate this imminent risk. Progress alignment
algorithms learn to emulate the mechanics of human moral progress, thereby
addressing the susceptibility of existing alignment methods to contemporary
moral blindspots. To empower research in progress alignment, we introduce
ProgressGym, an experimental framework allowing the learning of moral progress
mechanics from history, in order to facilitate future progress in real-world
moral decisions. Leveraging 9 centuries of historical text and 18 historical
LLMs, ProgressGym enables codification of real-world progress alignment
challenges into concrete benchmarks. Specifically, we introduce three core
challenges: tracking evolving values (PG-Follow), preemptively anticipating
moral progress (PG-Predict), and regulating the feedback loop between human and
AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension
are inapplicable to these tasks. In response, we present lifelong and
extrapolative algorithms as baseline methods of progress alignment, and build
an open leaderboard soliciting novel algorithms and challenges. The framework
and the leaderboard are available at
https://github.com/PKU-Alignment/ProgressGym and
https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard
respectively.Summary
AI-Generated Summary