LLM2LLM: 새로운 반복적 데이터 강화를 통한 LLM 성능 향상

초록

사전 학습된 대규모 언어 모델(LLM)은 현재 대부분의 자연어 처리 작업을 해결하는 데 있어 최첨단 기술로 자리 잡고 있습니다. 많은 실제 애플리케이션들은 만족스러운 성능 수준에 도달하기 위해 여전히 미세 조정이 필요하지만, 이들 중 상당수는 데이터가 부족한 상황에 있어 미세 조정이 어려운 경우가 많습니다. 이를 해결하기 위해, 우리는 LLM2LLM이라는 목표 지향적이고 반복적인 데이터 증강 전략을 제안합니다. 이 전략은 교사 LLM을 사용하여 특정 작업에 대한 미세 조정에 사용할 수 있는 추가 데이터를 증강함으로써 작은 시드 데이터셋을 강화합니다. LLM2LLM은 (1) 초기 시드 데이터에 대해 기본 학생 LLM을 미세 조정하고, (2) 모델이 잘못 예측한 데이터 포인트를 평가 및 추출하며, (3) 교사 LLM을 사용하여 이러한 잘못된 데이터 포인트를 기반으로 합성 데이터를 생성한 후 이를 다시 훈련 데이터에 추가합니다. 이 접근 방식은 훈련 중 LLM이 잘못 예측한 데이터 포인트의 신호를 증폭시키고 이를 데이터셋에 재통합하여 LLM이 더 어려운 예제에 집중할 수 있도록 합니다. 우리의 결과는 LLM2LLM이 데이터가 부족한 상황에서 LLM의 성능을 크게 향상시키며, 전통적인 미세 조정 및 기타 데이터 증강 기준선을 능가함을 보여줍니다. LLM2LLM은 노동 집약적인 데이터 큐레이션에 대한 의존도를 줄이고, 더 확장 가능하고 성능이 뛰어난 LLM 솔루션을 위한 길을 열어 데이터가 제한된 도메인과 작업을 해결할 수 있게 합니다. 우리는 LLaMA2-7B 학생 모델을 사용하여 데이터가 부족한 상황에서 GSM8K 데이터셋에서 24.2%, CaseHOLD에서 32.6%, SNIPS에서 32.0%, TREC에서 52.6%, SST-2에서 39.8%의 성능 향상을 달성했습니다.

English

Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a LLaMA2-7B student model.

LLM2LLM: 새로운 반복적 데이터 강화를 통한 LLM 성능 향상

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

초록

Support