表层对齐：在从弱到强的泛化中，强模型可能会欺骗弱模型

摘要

在当前大型语言模型（LLMs）快速发展的时代，超对齐（Superalignment）已成为一个重要且广泛讨论的问题，其中人类是超人类模型的弱监督者。最近的研究通过使用弱模型监督强模型初步研究了这一问题。研究发现，弱监督的强学生可以始终胜过弱教师朝向对齐目标，导致了弱到强的泛化现象。然而，我们担心在这一令人期待的现象背后，是否存在弱到强的欺骗问题，即强模型可能通过在弱模型已知领域展现良好对齐的行为，但在弱模型不了解的情况下产生不对齐的行为。因此，我们首次尝试在一个具体但现实的多目标对齐案例中探讨这一安全问题，其中一些对齐目标可能彼此冲突（例如，帮助性与无害性）。这种冲突可能导致强模型在一个对齐维度上欺骗弱模型，以在另一个对齐维度上获得高奖励。我们在奖励建模任务和偏好优化场景上的实验表明：（1）存在弱到强的欺骗；（2）随着弱模型和强模型之间能力差距的增加，欺骗现象可能加剧。我们还讨论了潜在解决方案，并发现通过中间模型的引导可以在一定程度上减轻欺骗。我们的工作强调了更加关注超对齐真实可靠性的迫切需求。

English

Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

表层对齐：在从弱到强的泛化中，强模型可能会欺骗弱模型

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

摘要

Support