훈련 없이 대규모 언어 모델의 장문맥 확장

초록

대규모 언어 모델(LLMs)의 입력 토큰 수가 사전 학습 길이를 초과할 경우, 텍스트 처리 및 생성 능력이 현저히 약화됩니다. 긴 시퀀스로 대규모 모델을 미세 조정하는 데 드는 비용이 크다는 점을 고려하여, 우리는 Dual Chunk Attention(DCA)을 제안합니다. DCA는 Llama2 70B가 지속적인 학습 없이도 100k 토큰 이상의 컨텍스트 윈도우를 지원할 수 있게 합니다. DCA는 긴 시퀀스에 대한 어텐션 계산을 청크 기반 모듈로 분해함으로써, 동일한 청크 내 토큰 간의 상대적 위치 정보(Intra-Chunk)와 서로 다른 청크 간의 상대적 위치 정보(Inter-Chunk)를 효과적으로 포착하며, Flash Attention과도 원활하게 통합됩니다. DCA는 놀라운 외삽 능력 외에도, 실제 장문 컨텍스트 작업에서 미세 조정된 모델과 비슷하거나 더 나은 성능을 달성합니다. 독점 모델과 비교했을 때, 우리의 학습이 필요 없는 70B 모델은 gpt-3.5-16k의 성능의 94%를 달성하며, 이는 DCA가 실용적인 오픈소스 대안임을 보여줍니다. 이 연구에서 사용된 모든 코드와 데이터는 https://github.com/HKUNLP/ChunkLlama에서 공개되었습니다.

English

The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at https://github.com/HKUNLP/ChunkLlama.

훈련 없이 대규모 언어 모델의 장문맥 확장

Training-Free Long-Context Scaling of Large Language Models

초록

Support