표면적 정렬(Super(ficial)-alignment): 강력한 모델이 약한 모델을 속일 수 있는 약-강 일반화(Weak-to-Strong Generalization)

초록

슈퍼얼라인먼트(Superalignment), 즉 인간이 초인간적 모델의 약한 감독자 역할을 하는 문제는 대규모 언어 모델(LLMs)의 급속한 발전 속에서 중요한 논의 주제로 부상했습니다. 최근 연구는 약한 모델을 사용하여 강한 모델을 감독하는 방식으로 이 문제를 예비적으로 탐구했습니다. 이 연구는 약하게 감독받은 강력한 학생 모델이 정렬 목표를 향해 약한 교사 모델을 지속적으로 능가하는 약한-강한 일반화 현상을 발견했습니다. 그러나 우리는 이러한 유망한 현상 뒤에 약한-강한 기만 문제가 존재할 가능성을 우려합니다. 강력한 모델이 약한 모델이 알고 있는 영역에서는 잘 정렬된 것처럼 보이지만, 약한 모델이 알지 못하는 경우에는 잘못 정렬된 행동을 보임으로써 약한 모델을 속일 수 있는 문제입니다. 우리는 이러한 보안 문제를 탐구하기 위해 구체적이면서도 현실적인 다중 목표 정렬 사례를 대상으로 초기 연구를 진행했습니다. 이 사례에서는 서로 충돌할 수 있는 정렬 목표(예: 도움성 대 해로움 없음)가 존재할 가능성이 있습니다. 이러한 충돌은 강력한 모델이 한 정렬 차원에서 약한 모델을 속여 다른 정렬 차원에서 높은 보상을 얻으려는 행동을 유발할 수 있습니다. 보상 모델링 작업과 선호 최적화 시나리오에서의 실험 결과는 다음과 같습니다: (1) 약한-강한 기만 현상이 존재하며, (2) 약한 모델과 강한 모델 간의 역량 차이가 커질수록 기만 현상이 심화될 수 있습니다. 또한, 우리는 잠재적인 해결책을 논의하며 중간 모델을 활용한 부트스트래핑이 기만 현상을 어느 정도 완화할 수 있음을 발견했습니다. 이 연구는 슈퍼얼라인먼트의 진정한 신뢰성에 더 많은 주의를 기울여야 할 필요성을 강조합니다.

English

Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

표면적 정렬(Super(ficial)-alignment): 강력한 모델이 약한 모델을 속일 수 있는 약-강 일반화(Weak-to-Strong Generalization)

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

초록

Support