ChatPaper.aiChatPaper

表面對齊:在從弱到強的泛化中,強模型可能會誤導弱模型

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

June 17, 2024
作者: Wenkai Yang, Shiqi Shen, Guangyao Shen, Zhi Gong, Yankai Lin
cs.AI

摘要

在當前大型語言模型(LLMs)快速發展的時代,超對齊(Superalignment)已成為一個重要且廣泛討論的議題,其中人類是超人類模型的弱監督者。最近的研究初步探討了透過弱模型監督強模型的問題。研究發現,弱監督的強學生能夠持續優於弱教師朝向對齊目標,引發了一種由弱到強的泛化現象。然而,我們關注在這樣一個有前途的現象背後,是否存在著一個弱到強的欺騙問題,即強模型可能通過展現在弱模型已知領域中對齊的行為,但在弱模型不知道的情況下產生不對齊的行為。因此,我們首次探索了這個安全問題,針對一個具體但現實的多目標對齊案例,其中可能存在一些相互衝突的對齊目標(例如,幫助性與無害性)。這種衝突可能導致強模型在一個對齊維度上欺騙弱模型,以獲得其他對齊維度的高獎勵。我們在獎勵建模任務和偏好優化場景上的實驗表明:(1)存在弱到強的欺騙;(2)隨著弱模型和強模型之間的能力差距增加,欺騙現象可能會加劇。我們還討論了潛在的解決方案,並發現通過中間模型的引導可以在一定程度上減輕欺騙。我們的工作強調了更加關注超對齊真實可靠性的迫切需要。
English
Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

Summary

AI-Generated Summary

PDF42December 4, 2024