안정성 격차 완화를 통한 효율적인 지속적 사전 학습

초록

지속적 사전 학습(Continual Pre-training)은 대규모 언어 모델(LLMs)을 새로운 도메인에 적응시키기 위한 주요 접근 방식으로 점차 자리 잡고 있습니다. 이 과정은 사전 학습된 LLM을 새로운 도메인의 코퍼스로 업데이트하여 학습 분포를 변화시키는 것을 포함합니다. 이러한 변화 과정에서 LLM의 동작을 연구하기 위해, 우리는 지속적 사전 학습 과정 전반에 걸쳐 모델의 성능을 측정했습니다. 그 결과, 초기에 일시적인 성능 하락이 발생한 후 회복 단계를 거치는 현상을 관찰했는데, 이는 새로운 클래스를 분류하는 비전 모델에서 이전에 보고된 "안정성 격차(stability gap)" 현상과 유사합니다. 이 문제를 해결하고 고정된 컴퓨팅 예산 내에서 LLM의 성능을 향상시키기 위해, 우리는 세 가지 효과적인 전략을 제안합니다: (1) 적절한 크기의 부분 집합에 대해 여러 에포크(epoch) 동안 지속적으로 사전 학습을 진행하여, 대규모 코퍼스를 단일 에포크로 사전 학습하는 것보다 더 빠르게 성능을 회복시키는 방법; (2) 고품질의 부분 코퍼스만을 사용하여 사전 학습을 진행함으로써 도메인 성능을 빠르게 향상시키는 방법; (3) 사전 학습 데이터와 유사한 데이터 혼합을 사용하여 분포 격차를 줄이는 방법. 우리는 Llama 계열 모델을 대상으로 다양한 실험을 수행하여 의료 지속적 사전 학습과 명령어 튜닝(instruction tuning)에서 이 전략들의 효과를 검증했습니다. 예를 들어, 우리의 전략은 OpenLlama-3B 모델의 평균 의료 작업 성능을 원래 학습 예산의 40%만 사용하여 36.2%에서 40.7%로 향상시켰으며, 일반 작업의 평균 성능도 향상시키면서도 망각(forgetting) 현상을 유발하지 않았습니다. 또한, 우리는 이 전략들을 Llama-3-8B 모델에 적용했습니다. 그 결과로 얻은 Llama-3-Physician 모델은 현재 오픈소스 모델 중 최고의 의료 성능을 보였으며, 여러 의료 벤치마크에서 GPT-4와 비슷하거나 더 나은 성능을 달성했습니다. 우리는 이 모델을 https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct 에 공개했습니다.

English

Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct.

안정성 격차 완화를 통한 효율적인 지속적 사전 학습

Efficient Continual Pre-training by Mitigating the Stability Gap

초록

Support