ChatPaper.aiChatPaper

優秀的模型思考方式相似,這削弱了人工智慧監督。

Great Models Think Alike and this Undermines AI Oversight

February 6, 2025
作者: Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
cs.AI

摘要

隨著語言模型(LM)能力的提升,對其進行大規模評估和監督對人類來說變得更加困難。有希望其他語言模型可以自動化這兩個任務,我們稱之為「AI監督」。我們研究了模型相似性如何影響AI監督的兩個方面,提出了一個基於模型錯誤重疊的LM相似性的概率度量。利用這個度量,我們首先展示了作為評判的LLM對模型進行評分偏好於與評判相似的模型,概括了最近的自我偏好結果。然後,我們研究了在LM標註上的訓練,發現弱監督者和強學生模型之間的互補知識在「由弱到強的泛化」中扮演了關鍵角色。隨著模型能力的提高,發現其錯誤變得更加困難,我們可能會更多地依賴AI監督。然而,我們觀察到一個令人擔憂的趨勢——隨著能力的增強,模型的錯誤變得更加相似,指向由相關失敗帶來的風險。我們的工作強調了報告和校正模型相似性的重要性,特別是在AI監督新興範式中。
English
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Summary

AI-Generated Summary

PDF342February 7, 2025