信任函数：通过学会何时信任弱教师实现近无损耗的弱到强泛化

摘要

弱到强泛化研究的是在可靠标签稀缺的情况下，如何利用较弱教师模型的监督来提升较强学生模型的能力。我们将此主要视为一个数据选择问题，其核心挑战在于识别哪些弱标签足够可靠，能够作为训练信号。为此，我们引入了信任函数，为每个弱标签分配一个标量信任分数，并利用这些分数筛选弱监督。在多个领域，包括世界知识、定量推理和策略游戏中，信任过滤生成的学生模型能够匹配甚至超越真实监督，实现近乎无损的弱到强泛化。此外，信任函数还支持迭代的弱到强链，通过训练学生模型并重复使用其作为下一阶段的教师模型，从而放大收益。信任函数的优势可归因于多种机制。

English

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.