PockEngine：口袋中的稀疏高效微調

摘要

在設備上進行學習和有效的微調，能夠實現持續且保護隱私的定制化（例如，在個性化數據上對大型語言模型進行本地微調）。然而，現有的訓練框架是為具有強大加速器的雲伺服器（例如GPU、TPU）設計的，缺乏針對邊緣學習進行優化的方法，這面臨著資源限制和邊緣硬件多樣性的挑戰。我們介紹了PockEngine：一個小型、稀疏且高效的引擎，可在各種邊緣設備上進行微調。PockEngine支援稀疏反向傳播：它修剪了反向圖並通過測量的內存節省和延遲減少來稀疏更新模型，同時保持模型質量。其次，PockEngine是首次編譯：整個訓練圖（包括前向、反向和優化步驟）在編譯時被推導出來，這減少了運行時的開銷並為圖形轉換帶來機會。PockEngine還整合了豐富的訓練圖優化集，因此可以進一步加速訓練成本，包括運算符重排序和後端切換。PockEngine支援多樣的應用、前端和硬件後端：它靈活地編譯和調整在PyTorch/TensorFlow/Jax中定義的模型，並將二進制部署到移動CPU/GPU/DSP。我們在視覺模型和大型語言模型上評估了PockEngine。PockEngine在現成的TensorFlow（Raspberry Pi）上實現了最高達15倍的加速，反向傳播節省了5.6倍的內存（Jetson AGX Orin）。值得注意的是，PockEngine實現了在NVIDIA Jetson AGX Orin上對LLaMav2-7B進行微調，速度達550 tokens/s，比PyTorch快了7.9倍。

English

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 times speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 times memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9times faster than the PyTorch.

PockEngine：口袋中的稀疏高效微調

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

摘要

Support