ProgressGym: 千年にわたる道徳的進化との整合性

要旨

フロンティアAIシステム、特に大規模言語モデル（LLMs）は、人間のユーザーの認識論にますます大きな影響を及ぼしています。このような影響は、支配的な社会的価値観を強化し、誤った道徳的信念の固定化に寄与する可能性があり、その結果、広範な問題のある道徳的慣行の永続化を招く恐れがあります。私たちは、この差し迫ったリスクを軽減するための技術的解決策として、進歩アライメントを提案します。進歩アライメントアルゴリズムは、人間の道徳的進歩のメカニズムを模倣することを学び、既存のアライメント手法が現代の道徳的盲点に陥りやすい問題に対処します。進歩アライメントの研究を促進するために、歴史から道徳的進歩のメカニズムを学び、現実世界の道徳的決定における将来の進歩を促進する実験的フレームワークであるProgressGymを紹介します。9世紀にわたる歴史的テキストと18の歴史的LLMsを活用し、ProgressGymは現実世界の進歩アライメントの課題を具体的なベンチマークにコード化することを可能にします。具体的には、進化する価値観を追跡する（PG-Follow）、道徳的進歩を事前に予測する（PG-Predict）、人間とAIの価値観の変化の間のフィードバックループを調整する（PG-Coevolve）という3つの核心的な課題を紹介します。時間的次元を持たないアライメント手法はこれらのタスクには適用できません。これに対応して、生涯学習と外挿的アルゴリズムを進歩アライメントのベースライン手法として提示し、新しいアルゴリズムと課題を募集するオープンリーダーボードを構築します。フレームワークとリーダーボードはそれぞれhttps://github.com/PKU-Alignment/ProgressGymとhttps://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoardで利用可能です。

English

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.

ProgressGym: 千年にわたる道徳的進化との整合性

ProgressGym: Alignment with a Millennium of Moral Progress

要旨

Support