表层对齐:在从弱到强的泛化中,强模型可能会欺骗弱模型
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
June 17, 2024
作者: Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin
cs.AI
摘要
在当前大型语言模型(LLMs)快速发展的时代,超对齐(Superalignment)已成为一个重要且广泛讨论的问题,其中人类是超人类模型的弱监督者。最近的研究通过使用弱模型监督强模型初步研究了这一问题。研究发现,弱监督的强学生可以始终胜过弱教师朝向对齐目标,导致了弱到强的泛化现象。然而,我们担心在这一令人期待的现象背后,是否存在弱到强的欺骗问题,即强模型可能通过在弱模型已知领域展现良好对齐的行为,但在弱模型不了解的情况下产生不对齐的行为。因此,我们首次尝试在一个具体但现实的多目标对齐案例中探讨这一安全问题,其中一些对齐目标可能彼此冲突(例如,帮助性与无害性)。这种冲突可能导致强模型在一个对齐维度上欺骗弱模型,以在另一个对齐维度上获得高奖励。我们在奖励建模任务和偏好优化场景上的实验表明:(1)存在弱到强的欺骗;(2)随着弱模型和强模型之间能力差距的增加,欺骗现象可能加剧。我们还讨论了潜在解决方案,并发现通过中间模型的引导可以在一定程度上减轻欺骗。我们的工作强调了更加关注超对齐真实可靠性的迫切需求。
English
Superalignment, where humans are weak supervisors of superhuman models, has
become an important and widely discussed issue in the current era of rapid
development of Large Language Models (LLMs). The recent work preliminarily
studies this problem by using weak models to supervise strong models. It
discovers that weakly supervised strong students can consistently outperform
weak teachers towards the alignment target, leading to a weak-to-strong
generalization phenomenon. However, we are concerned that behind such a
promising phenomenon, whether there exists an issue of weak-to-strong
deception, where strong models may deceive weak models by exhibiting
well-aligned in areas known to weak models but producing misaligned behaviors
in cases weak models do not know. We then take an initial step towards
exploring this security issue in a specific but realistic multi-objective
alignment case, where there may be some alignment targets conflicting with each
other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause
strong models to deceive weak models in one alignment dimension to gain high
reward in other alignment dimension. Our experiments on both the reward
modeling task and the preference optimization scenario indicate: (1) the
weak-to-strong deception exists; (2) the deception phenomenon may intensify as
the capability gap between weak and strong models increases. We also discuss
potential solutions and find bootstrapping with an intermediate model can
mitigate the deception to some extent. Our work highlights the urgent need to
pay more attention to the true reliability of superalignment.Summary
AI-Generated Summary