ChatPaper.aiChatPaper

辯論有助於從弱到強的泛化。

Debate Helps Weak-to-Strong Generalization

January 21, 2025
作者: Hao Lang, Fei Huang, Yongbin Li
cs.AI

摘要

對齊已具備能力模型與期望行為的常見方法依賴人類提供監督。然而,未來的超人類模型將超越人類的能力。因此,人類只能對超人類模型進行弱監督。這種預期的人類評估不足將削弱未來人工智慧系統的安全性。可擴展的監督和弱到強的泛化是應對這一問題的兩種互補方法。在本文中,我們試圖結合這兩種方法的優勢,以進一步改善對齊。具體而言,我們探討了通過強預訓練模型來改善人類監督的方式,然後用增強的弱人類監督來監督強模型。為了進行迭代的實證進展,我們考慮了一個類比:我們能否使用強模型來改善弱模型的監督,然後再用它來監督強模型?我們通過在地面真實標籤上對一個小型弱模型進行微調,並獲得來自大型強模型的額外幫助,然後通過弱模型生成的標籤對強模型進行微調來進行實證測試。我們發現辯論可以幫助弱模型從不可信的強模型中提取可信賴的信息,這在訓練弱模型時提供了樣本的上下文。我們還展示了一組弱模型有助於利用強模型辯論者生成的長論點,並獲得更穩健的監督估計。對OpenAI弱到強自然語言處理基準進行的大量實驗表明,結合方法導致更好的對齊,這表明辯論有助於幫助弱到強的泛化。
English
Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

Summary

AI-Generated Summary

PDF72January 24, 2025