텍스트 확산 모델을 위한 전이 학습

초록

본 보고서에서는 대규모 언어 모델(LLM)의 학습 및 배포를 위해 텍스트 확산(text diffusion)이 자기회귀(AR) 디코딩을 대체할 가능성을 탐구합니다. 특히, 사전 학습된 AR 모델이 "AR2Diff"라 명명한 경량화된 적응 절차를 통해 텍스트 확산 모델로 변환될 수 있는지에 주목합니다. 먼저, 텍스트 확산 모델 학습을 위한 강력한 베이스라인 설정을 구축합니다. 다양한 아키텍처와 사전 학습 목표를 비교한 결과, 프리픽스 언어 모델(prefix LM) 목표로 디코더만 사용한 모델이 여러 작업에서 최상 또는 근접한 성능을 보임을 확인했습니다. 이를 바탕으로, 텍스트 확산 모델을 위한 다양한 전이 학습 설정을 테스트합니다. 기계 번역에서는 텍스트 확산이 표준 AR 접근법에 비해 성능이 떨어지는 것으로 나타났습니다. 그러나 코드 합성 및 추출형 질의응답(extractive QA) 작업에서는 처음부터 학습된 확산 모델이 많은 경우에서 AR 모델을 능가하는 성능을 보였습니다. 또한, AR 모델을 확산 디코딩을 사용하도록 적응시키는 AR2Diff를 통해 품질 향상을 관찰했습니다. 이러한 결과는 텍스트 확산이 상대적으로 덜 탐구된 분야임에도 불구하고, 긴 텍스트 생성에서 AR 디코딩보다 상당히 빠를 수 있다는 점에서 유망합니다.

English

In this report, we explore the potential for text diffusion to replace autoregressive (AR) decoding for the training and deployment of large language models (LLMs). We are particularly interested to see whether pretrained AR models can be transformed into text diffusion models through a lightweight adaptation procedure we call ``AR2Diff''. We begin by establishing a strong baseline setup for training text diffusion models. Comparing across multiple architectures and pretraining objectives, we find that training a decoder-only model with a prefix LM objective is best or near-best across several tasks. Building on this finding, we test various transfer learning setups for text diffusion models. On machine translation, we find that text diffusion underperforms the standard AR approach. However, on code synthesis and extractive QA, we find diffusion models trained from scratch outperform AR models in many cases. We also observe quality gains from AR2Diff -- adapting AR models to use diffusion decoding. These results are promising given that text diffusion is relatively underexplored and can be significantly faster than AR decoding for long text generation.

텍스트 확산 모델을 위한 전이 학습

Transfer Learning for Text Diffusion Models

초록

Support