잠재적 사고로부터 학습하기 위한 추론

초록

언어 모델(LM) 사전 학습을 위한 컴퓨팅 자원의 확장 속도가 인간이 작성한 텍스트의 증가 속도를 앞지르면서, 데이터가 LM 확장의 병목 현상이 될 것이라는 우려가 제기되고 있습니다. 이러한 데이터 제약 상황에서 사전 학습의 확장을 지속하기 위해, 우리는 텍스트 생성 과정의 기저에 있는 잠재적 사고를 명시적으로 모델링하고 추론함으로써 사전 학습 데이터의 효율성을 크게 향상시킬 수 있다고 제안합니다. 직관적으로, 우리의 접근 방식은 웹 텍스트를 인간의 상세한 사고 과정의 압축된 최종 결과물로 간주하며, 잠재적 사고에는 데이터 효율적 학습에 중요한 문맥적 지식과 추론 단계가 포함되어 있다고 봅니다. 우리는 수학 분야에서 데이터 제약 하의 지속적 사전 학습을 통해 이 접근 방식의 효과를 실증적으로 입증합니다. 먼저, 잠재적 사고를 추론하기 위한 합성 데이터 접근법이 데이터 효율성을 크게 향상시켜 동일한 양의 원시 데이터를 사용한 학습을 능가함을 보여줍니다(MATH 데이터셋에서 5.7% → 25.4%). 더 나아가, 강력한 교사 모델 없이도 잠재적 사고 추론이 가능함을 입증합니다. 여기서 LM은 EM 알고리즘을 사용하여 학습된 LM의 능력과 사고가 강화된 사전 학습 데이터의 품질을 반복적으로 개선함으로써 스스로의 성능을 부트스트랩합니다. 우리는 1B 규모의 LM이 최소 세 번의 반복을 통해 성능을 부트스트랩할 수 있으며, 원시 데이터로 학습된 베이스라인을 크게 능가함을 보여줍니다. 또한 E-단계를 수행할 때 추가적인 추론 컴퓨팅 자원을 투입할수록 점점 더 큰 성능 향상을 얻을 수 있습니다. 추론 확장과 EM 반복을 통해 얻은 성능 향상은 데이터 제약 하의 사전 학습 확장을 위한 새로운 가능성을 제시합니다.

English

Compute scaling for language model (LM) pretraining has outpaced the growth of human-written texts, leading to concerns that data will become the bottleneck to LM scaling. To continue scaling pretraining in this data-constrained regime, we propose that explicitly modeling and inferring the latent thoughts that underlie the text generation process can significantly improve pretraining data efficiency. Intuitively, our approach views web text as the compressed final outcome of a verbose human thought process and that the latent thoughts contain important contextual knowledge and reasoning steps that are critical to data-efficient learning. We empirically demonstrate the effectiveness of our approach through data-constrained continued pretraining for math. We first show that synthetic data approaches to inferring latent thoughts significantly improve data efficiency, outperforming training on the same amount of raw data (5.7\% rightarrow 25.4\% on MATH). Furthermore, we demonstrate latent thought inference without a strong teacher, where an LM bootstraps its own performance by using an EM algorithm to iteratively improve the capability of the trained LM and the quality of thought-augmented pretraining data. We show that a 1B LM can bootstrap its performance across at least three iterations and significantly outperform baselines trained on raw data, with increasing gains from additional inference compute when performing the E-step. The gains from inference scaling and EM iterations suggest new opportunities for scaling data-constrained pretraining.

잠재적 사고로부터 학습하기 위한 추론

Reasoning to Learn from Latent Thoughts

초록

Support