信頼関数：弱い教師を信頼するタイミングの学習によるニアロスレスな弱→強汎化

要旨

弱から強への一般化（weak-to-strong generalization）は、信頼できるラベルが不足している状況において、弱い教師からの教師信号を用いて強力な学生モデルを改善する方法を研究するものである。我々はこれを主にデータ選択問題と捉え、その主要な課題は、どの弱ラベルが訓練信号として十分信頼できるかを特定することである。この問題に対処するため、我々は信頼関数（trust function）を導入する。これは各弱ラベルにスカラーの信頼スコアを割り当て、そのスコアを用いて弱い教師信号をフィルタリングするものである。世界知識、量的推論、戦略ゲームなど、いくつかの領域において、信頼フィルタリングにより、正解教師信号に匹敵する、場合によってはそれを上回る学生モデルが得られ、ほぼ損失のない弱から強への一般化を達成した。さらに、信頼関数は反復的な弱から強への連鎖を可能にし、学生モデルを訓練して次の教師として再利用することで利益を積み重ね、その利得を増幅する。信頼関数の優位性は、いくつかのメカニズムに帰することができる。

English

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.