70% 크기, 100% 정확도: 동적 길이 부동소수점을 통한 효율적인 GPU 추론을 위한 무손실 LLM 압축

초록

대형 언어 모델(LLMs)의 규모가 빠르게 증가함에 따라, 자원이 제한된 하드웨어에서의 효율적인 배포가 중요한 과제로 대두되고 있습니다. 본 논문에서는 LLM의 크기를 30% 줄이면서도 원본 모델과 비트 단위로 동일한 출력을 보장하는 무손실 압축 프레임워크인 Dynamic-Length Float(DFloat11)를 소개합니다. DFloat11은 LLM의 BFloat16 가중치 표현에서 낮은 엔트로피를 관찰하여 기존 저장 형식의 비효율성을 해결하고자 개발되었습니다. 엔트로피 코딩을 적용함으로써, DFloat11은 가중치의 빈도에 따라 동적 길이 인코딩을 할당하여 정밀도 손실 없이 정보 이론적 최적에 가까운 압축을 달성합니다. 동적 길이 인코딩을 사용한 효율적인 추론을 위해, 빠른 온라인 압축 해제를 위한 맞춤형 GPU 커널을 개발했습니다. 우리의 설계는 다음과 같은 요소를 포함합니다: (i) 메모리 집약적인 룩업 테이블(LUTs)을 GPU SRAM에 적합한 컴팩트 LUTs로 분해, (ii) 경량 보조 변수를 사용하여 스레드 읽기/쓰기 위치를 조정하는 2단계 커널, (iii) 지연 시간을 최소화하기 위한 트랜스포머 블록 수준의 압축 해제. Llama-3.1, Qwen-2.5, Gemma-3 등 최신 모델에 대한 실험을 통해 DFloat11이 약 30%의 모델 크기 감소를 달성하면서도 비트 단위로 정확한 출력을 유지한다는 가설을 검증했습니다. 메모리 제약을 충족하기 위해 압축되지 않은 모델의 일부를 CPU로 오프로딩하는 잠재적 대안과 비교했을 때, DFloat11은 토큰 생성에서 1.9-38.8배 높은 처리량을 달성했습니다. 고정된 GPU 메모리 예산 내에서, DFloat11은 압축되지 않은 모델보다 5.3-13.17배 더 긴 컨텍스트 길이를 가능하게 합니다. 특히, 우리의 방법은 810GB 크기의 Llama-3.1-405B 모델을 8x80GB GPU가 장착된 단일 노드에서 무손실 추론할 수 있게 합니다. 우리의 코드와 모델은 https://github.com/LeanModels/DFloat11에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) decomposition of memory-intensive lookup tables (LUTs) into compact LUTs that fit in GPU SRAM, (ii) a two-phase kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on recent models, including Llama-3.1, Qwen-2.5, and Gemma-3, validates our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit exact outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. Notably, our method enables lossless inference of Llama-3.1-405B, an 810GB model, on a single node equipped with 8x80GB GPUs. Our code and models are available at https://github.com/LeanModels/DFloat11.

70% 크기, 100% 정확도: 동적 길이 부동소수점을 통한 효율적인 GPU 추론을 위한 무손실 LLM 압축

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

초록

Support