신뢰 함수: 약한 교사를 언제 신뢰할지 학습하여 거의 손실 없는 약-강 일반화

초록

약한-강한 일반화는 신뢰할 수 있는 레이블이 부족한 상황에서 약한 교사의 지도를 이용하여 강한 학생을 향상시키는 방법을 연구한다. 본 연구는 이를 주로 데이터 선택 문제로 보며, 핵심 과제는 훈련 신호로 사용할 수 있을 만큼 신뢰할 수 있는 약한 레이블을 식별하는 것이다. 이를 해결하기 위해, 각 약한 레이블에 스칼라 신뢰 점수를 할당하는 신뢰 함수를 도입하고, 이 점수를 사용하여 약한 지도를 필터링한다. 세계 지식, 양적 추론, 전략 게임을 포함한 여러 도메인에서 신뢰 필터링은 때로는 참값 지도에 필적하거나 능가하는 학생을 생성하여 거의 손실 없는 약한-강한 일반화를 달성한다. 또한 신뢰 함수는 학생을 훈련하고 다음 교사로 재사용하여 이득을 증폭시키는 반복적인 약한-강한 체인을 가능하게 한다. 신뢰 함수의 장점은 여러 메커니즘에 기인할 수 있다.

English

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.