PockEngine:口袋中的稀疏高效微调
PockEngine: Sparse and Efficient Fine-tuning in a Pocket
October 26, 2023
作者: Ligeng Zhu, Lanxiang Hu, Ji Lin, Wei-Chen Wang, Wei-Ming Chen, Chuang Gan, Song Han
cs.AI
摘要
在设备端学习和高效微调实现了持续且保护隐私的定制化(例如,在个性化数据上对大型语言模型进行本地微调)。然而,现有的训练框架是为云服务器设计的,配备强大的加速器(例如,GPU、TPU),缺乏针对边缘学习的优化,面临资源有限和边缘硬件多样性挑战。我们介绍了PockEngine:一种微小、稀疏且高效的引擎,可在各种边缘设备上实现微调。PockEngine支持稀疏反向传播:它修剪反向图并稀疏更新模型,节省内存并降低延迟,同时保持模型质量。其次,PockEngine是首先编译的:整个训练图(包括前向、反向和优化步骤)在编译时派生,减少运行时开销并提供图转换机会。PockEngine还集成了丰富的训练图优化集,可以进一步加速训练成本,包括操作符重排序和后端切换。PockEngine支持多样的应用、前端和硬件后端:它灵活地编译和调整在PyTorch/TensorFlow/Jax中定义的模型,并将二进制部署到移动CPU/GPU/DSP。我们在视觉模型和大型语言模型上评估了PockEngine。PockEngine在现成的TensorFlow(树莓派)上实现了高达15倍的加速,反向传播节省了5.6倍的内存(Jetson AGX Orin)。值得注意的是,PockEngine使得在NVIDIA Jetson AGX Orin上对LLaMav2-7B进行微调达到每秒550个标记,比PyTorch快7.9倍。
English
On-device learning and efficient fine-tuning enable continuous and
privacy-preserving customization (e.g., locally fine-tuning large language
models on personalized data). However, existing training frameworks are
designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and
lack the optimizations for learning on the edge, which faces challenges of
resource limitations and edge hardware diversity. We introduce PockEngine: a
tiny, sparse and efficient engine to enable fine-tuning on various edge
devices. PockEngine supports sparse backpropagation: it prunes the backward
graph and sparsely updates the model with measured memory saving and latency
reduction while maintaining the model quality. Secondly, PockEngine is
compilation first: the entire training graph (including forward, backward and
optimization steps) is derived at compile-time, which reduces the runtime
overhead and brings opportunities for graph transformations. PockEngine also
integrates a rich set of training graph optimizations, thus can further
accelerate the training cost, including operator reordering and backend
switching. PockEngine supports diverse applications, frontends and hardware
backends: it flexibly compiles and tunes models defined in
PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We
evaluated PockEngine on both vision models and large language models.
PockEngine achieves up to 15 times speedup over off-the-shelf TensorFlow
(Raspberry Pi), 5.6 times memory saving back-propagation (Jetson AGX Orin).
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin
at 550 tokens/s, 7.9times faster than the PyTorch.