스펙트라: 삼항, 양자화, FP16 언어 모델에 대한 포괄적 연구

초록

사후 양자화(Post-training quantization)는 LLM 추론에서 메모리 관련 병목 현상을 해결하기 위한 주요 방법이지만, 불행히도 4비트 미만의 정밀도에서는 성능 저하가 크게 발생합니다. 이를 대체할 수 있는 접근법으로는 낮은 비트폭(예: 이진 또는 삼진 모델)에서 직접 압축된 모델을 훈련시키는 방법이 있습니다. 그러나 이러한 모델의 성능, 훈련 역학, 그리고 확장 추세는 아직 잘 이해되지 않고 있습니다. 이 문제를 해결하기 위해, 우리는 99M에서 3.9B 파라미터 범위의 54개 언어 모델로 구성된 Spectra LLM 스위트를 훈련시키고 공개했습니다. 이 모델들은 300B 토큰으로 훈련되었습니다. Spectra에는 FloatLM, 사후 양자화된 QuantLM(3, 4, 6, 8비트), 그리고 삼진 언어 모델링을 위한 개선된 아키텍처인 삼진 LLM(TriLM)이 포함되어 있습니다. TriLM은 주어진 크기(비트 단위)의 기존 삼진 모델을 크게 능가하며, 대규모에서 반정밀도 모델과도 성능이 일치합니다. 예를 들어, TriLM 3.9B는 반정밀도 FloatLM 830M보다 (비트 단위로) 작지만, 상식 추론 및 지식 벤치마크에서 반정밀도 FloatLM 3.9B와 동등한 성능을 보입니다. 그러나 TriLM 3.9B는 크기가 6배 더 큰 FloatLM 3.9B만큼 독성이 있고 고정관념적입니다. 또한, TriLM 3.9B는 검증 데이터셋과 웹 기반 코퍼스에서의 perplexity에서는 FloatLM에 뒤처지지만, Lambada와 PennTreeBank와 같은 덜 노이즈가 있는 데이터셋에서는 더 나은 성능을 보입니다. 낮은 비트폭 모델에 대한 이해를 높이기 위해, 우리는 Spectra 스위트의 500개 이상의 중간 체크포인트를 https://github.com/NolanoOrg/SpectraSuite{https://github.com/NolanoOrg/SpectraSuite}에서 공개하고 있습니다.

English

Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but unfortunately, it suffers from significant performance degradation below 4-bit precision. An alternative approach involves training compressed models directly at a low bitwidth (e.g., binary or ternary models). However, the performance, training dynamics, and scaling trends of such models are not yet well understood. To address this issue, we train and openly release the Spectra LLM suite consisting of 54 language models ranging from 99M to 3.9B parameters, trained on 300B tokens. Spectra includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8 bits), and ternary LLMs (TriLMs) - our improved architecture for ternary language modeling, which significantly outperforms previously proposed ternary models of a given size (in bits), matching half-precision models at scale. For example, TriLM 3.9B is (bit-wise) smaller than the half-precision FloatLM 830M, but matches half-precision FloatLM 3.9B in commonsense reasoning and knowledge benchmarks. However, TriLM 3.9B is also as toxic and stereotyping as FloatLM 3.9B, a model six times larger in size. Additionally, TriLM 3.9B lags behind FloatLM in perplexity on validation splits and web-based corpora but performs better on less noisy datasets like Lambada and PennTreeBank. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at https://github.com/NolanoOrg/SpectraSuite{https://github.com/NolanoOrg/SpectraSuite}.

스펙트라: 삼항, 양자화, FP16 언어 모델에 대한 포괄적 연구

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

초록

Support