信任函數：透過學習何時信任弱教師實現近乎無損的弱到強泛化

摘要

弱到强泛化（weak-to-strong generalization）研究的是在缺乏可靠標籤時，如何利用較弱教師的監督來提升較強學生的表現。我們將此問題視為一個數據篩選問題，其核心挑戰在於識別哪些弱標籤足夠可靠，可作為訓練信號。為解決此問題，我們引入了信任函數（trust functions），為每個弱標籤賦予一個標量信任分數，並利用這些分數來過濾弱監督。在多個領域（包括世界知識、數量推理與策略遊戲）中，信任過濾（trust filtering）使學生的表現能夠匹配，甚至在某些情況下超越真實監督（ground-truth supervision），實現近乎無損的弱到強泛化。此外，信任函數能夠建構一個迭代的弱到強鏈（iterative weak-to-strong chain），透過訓練學生並將其重複用作下一階段的教師，從而疊加增益、放大效果。信任函數的優勢可歸因於多種機制。

English

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.