PockEngine: 포켓에서의 희소하고 효율적인 미세 조정

초록

온디바이스 학습과 효율적인 파인튜닝은 지속적이고 개인정보 보호가 가능한 맞춤화를 가능하게 합니다(예: 개인화된 데이터를 기반으로 대규모 언어 모델을 로컬에서 파인튜닝). 그러나 기존의 학습 프레임워크는 강력한 가속기(예: GPU, TPU)를 갖춘 클라우드 서버를 위해 설계되었으며, 자원 제약과 엣지 하드웨어 다양성이라는 도전에 직면한 엣지 학습을 위한 최적화가 부족합니다. 우리는 PockEngine을 소개합니다: 다양한 엣지 디바이스에서 파인튜닝을 가능하게 하는 작고 희소하며 효율적인 엔진입니다. PockEngine은 희소 역전파를 지원합니다: 역전파 그래프를 가지치기하고 모델을 희소하게 업데이트하여 메모리 절약과 지연 시간 감소를 달성하면서도 모델 품질을 유지합니다. 둘째, PockEngine은 컴파일 우선 방식입니다: 전체 학습 그래프(순전파, 역전파, 최적화 단계 포함)가 컴파일 시점에 도출되어 런타임 오버헤드를 줄이고 그래프 변환의 기회를 제공합니다. PockEngine은 또한 다양한 학습 그래프 최적화를 통합하여 연산자 재정렬 및 백엔드 전환을 포함한 학습 비용을 더욱 가속화할 수 있습니다. PockEngine은 다양한 애플리케이션, 프론트엔드 및 하드웨어 백엔드를 지원합니다: PyTorch/TensorFlow/Jax로 정의된 모델을 유연하게 컴파일하고 조정하며 모바일 CPU/GPU/DSP에 바이너리를 배포합니다. 우리는 PockEngine을 비전 모델과 대규모 언어 모델 모두에서 평가했습니다. PockEngine은 오프더셸 TensorFlow(라즈베리 파이) 대비 최대 15배의 속도 향상, Jetson AGX Orin에서 역전파 시 5.6배의 메모리 절약을 달성했습니다. 특히, PockEngine은 NVIDIA Jetson AGX Orin에서 LLaMav2-7B을 550 토큰/초로 파인튜닝할 수 있으며, 이는 PyTorch보다 7.9배 빠른 속도입니다.

English

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 times speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 times memory saving back-propagation (Jetson AGX Orin). Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9times faster than the PyTorch.

PockEngine: 포켓에서의 희소하고 효율적인 미세 조정

PockEngine: Sparse and Efficient Fine-tuning in a Pocket

초록

Support