ProgressGym：與千年道德進步的一致

摘要

前沿的人工智慧系統，包括大型語言模型（LLMs），對人類使用者的認識論產生越來越大的影響。這種影響可以強化現存社會價值觀，潛在地促成錯誤道德信念的固化，進而在廣泛範圍內延續問題性的道德實踐。我們提出進展對齊作為一種技術解決方案，以減輕這一迫在眉睫的風險。進展對齊演算法學習模擬人類道德進步的機制，從而應對現有對齊方法對當代道德盲點的敏感性。為了促進進展對齊研究，我們引入ProgressGym，這是一個實驗性框架，允許從歷史中學習道德進步的機制，以便促進未來現實世界中的道德決策進展。利用9個世紀的歷史文本和18個歷史LLMs，ProgressGym使得將現實世界中的進展對齊挑戰編碼為具體基準成為可能。具體而言，我們提出三個核心挑戰：追蹤價值觀的演變（PG-Follow）、預先預測道德進步（PG-Predict）以及調節人類和人工智慧價值轉變之間的反饋循環（PG-Coevolve）。沒有時間維度的對齊方法無法應用於這些任務。為此，我們提出終身學習和外推演算法作為進展對齊的基準方法，並建立一個開放排行榜，徵求新穎的演算法和挑戰。這個框架和排行榜分別可在以下網址找到：https://github.com/PKU-Alignment/ProgressGym 和 https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard。

English

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.

ProgressGym：與千年道德進步的一致

ProgressGym: Alignment with a Millennium of Moral Progress

摘要

Support