社會對齊框架能提升大型語言模型的對齊效果

摘要

近期大型語言模型（LLMs）的發展重點在於生成符合人類期望並與共享價值觀一致的響應——這一過程被稱為對齊。然而，由於人類價值的複雜性與旨在解決這些問題的技術方法的狹隘性之間存在固有的脫節，對齊LLMs仍然具有挑戰性。當前的對齊方法常常導致目標設定不當，這反映了更廣泛的不完全合約問題，即在LLM對齊中，模型開發者與模型之間無法為每一種情境制定合約的不可行性。本文主張，改進LLM對齊需要融入來自社會對齊框架的見解，包括社會、經濟和契約對齊，並探討從這些領域汲取的潛在解決方案。考慮到不確定性在社會對齊框架中的角色，我們進一步研究了它如何在LLM對齊中體現。我們以對LLM對齊的另一種視角結束討論，將其目標未充分指定的特性視為一個機會，而非追求其完美定義。除了LLM對齊的技術改進外，我們還討論了參與式對齊介面設計的必要性。

English

Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared values - a process coined alignment. However, aligning LLMs remains challenging due to the inherent disconnect between the complexity of human values and the narrow nature of the technological approaches designed to address them. Current alignment methods often lead to misspecified objectives, reflecting the broader issue of incomplete contracts, the impracticality of specifying a contract between a model developer, and the model that accounts for every scenario in LLM alignment. In this paper, we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment, and discuss potential solutions drawn from these domains. Given the role of uncertainty within societal alignment frameworks, we then investigate how it manifests in LLM alignment. We end our discussion by offering an alternative view on LLM alignment, framing the underspecified nature of its objectives as an opportunity rather than perfect their specification. Beyond technical improvements in LLM alignment, we discuss the need for participatory alignment interface designs.