1750억 파라미터로 TransNormer 확장하기

초록

우리는 정확도와 효율성 모두에서 기존의 소프트맥스 어텐션 기반 모델을 능가하는 최초의 선형 어텐션 기반 대규모 언어 모델(LLM)인 TransNormerLLM을 소개합니다. TransNormerLLM은 이전의 선형 어텐션 아키텍처인 TransNormer를 발전시켜, 위치 임베딩, 선형 어텐션 가속, 게이팅 메커니즘, 텐서 정규화, 추론 가속 및 안정화와 같은 고급 수정을 포함합니다. 특히, LRPE와 지수 감쇠를 함께 사용하여 어텐션 희석 문제를 피하면서도 모델이 토큰 간의 전역적 상호작용을 유지할 수 있도록 합니다. 또한, 런타임에서 선형 어텐션을 두 배 이상 가속하고 메모리 사용량을 놀라울 정도로 네 배 줄이는 최첨단 기술인 Lightning Attention을 제안합니다. TransNormer의 성능을 더욱 향상시키기 위해, 우리는 게이팅 메커니즘을 활용하여 학습을 원활하게 하고 새로운 텐서 정규화 방식을 도입하여 모델을 가속화함으로써 20% 이상의 인상적인 가속을 달성했습니다. 더 나아가, 시퀀스 길이에 관계없이 수치적 안정성과 일관된 추론 속도를 보장하는 강력한 추론 알고리즘을 개발하여 학습 및 추론 단계 모두에서 우수한 효율성을 입증했습니다. 우리 모델의 설계 핵심은 확장성에 있으며, 대규모 클러스터에서 원활하게 배포할 수 있고 더 광범위한 모델로의 확장을 용이하게 하면서도 뛰어난 성능 지표를 유지합니다. 우리는 자체 수집한 6TB를 초과하고 2조 개 이상의 토큰을 포함하는 코퍼스에 대한 일련의 포괄적인 실험을 통해 모델 설계를 엄격하게 검증했습니다. 데이터 품질과 관련성을 보장하기 위해, 우리는 수집한 데이터를 필터링하기 위한 새로운 자체 정리 전략을 구현했습니다. 우리의 사전 학습된 모델은 효율적인 LLM 분야의 커뮤니티 발전을 촉진하기 위해 공개될 예정입니다.

English

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.

1750억 파라미터로 TransNormer 확장하기

Scaling TransNormer to 175 Billion Parameters

초록

Support