PockEngine：口袋中的稀疏高效微调

摘要

在设备端学习和高效微调实现了持续且保护隐私的定制化（例如，在个性化数据上对大型语言模型进行本地微调）。然而，现有的训练框架是为云服务器设计的，配备强大的加速器（例如，GPU、TPU），缺乏针对边缘学习的优化，面临资源有限和边缘硬件多样性挑战。我们介绍了PockEngine：一种微小、稀疏且高效的引擎，可在各种边缘设备上实现微调。PockEngine支持稀疏反向传播：它修剪反向图并稀疏更新模型，节省内存并降低延迟，同时保持模型质量。其次，PockEngine是首先编译的：整个训练图（包括前向、反向和优化步骤）在编译时派生，减少运行时开销并提供图转换机会。PockEngine还集成了丰富的训练图优化集，可以进一步加速训练成本，包括操作符重排序和后端切换。PockEngine支持多样的应用、前端和硬件后端：它灵活地编译和调整在PyTorch/TensorFlow/Jax中定义的模型，并将二进制部署到移动CPU/GPU/DSP。我们在视觉模型和大型语言模型上评估了PockEngine。PockEngine在现成的TensorFlow（树莓派）上实现了高达15倍的加速，反向传播节省了5.6倍的内存（Jetson AGX Orin）。值得注意的是，PockEngine使得在NVIDIA Jetson AGX Orin上对LLaMav2-7B进行微调达到每秒550个标记，比PyTorch快7.9倍。

English

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 times speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 times memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9times faster than the PyTorch.

PockEngine：口袋中的稀疏高效微调

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

摘要

Support