역전파 없이 양자화된 확산 모델의 효율적인 개인화

초록

디퓨전 모델은 이미지 합성에서 뛰어난 성능을 보여왔지만, 학습, 미세 조정 및 추론 과정에서 상당한 계산 및 메모리 자원을 요구합니다. 고급 양자화 기술이 추론 시 메모리 사용량을 최소화하는 데 성공했음에도 불구하고, 이러한 양자화된 모델의 학습과 미세 조정은 여전히 큰 메모리를 필요로 합니다. 이는 정확한 그래디언트 계산을 위한 역양자화 및/또는 그래디언트 기반 알고리즘을 위한 역전파 때문일 수 있습니다. 그러나 메모리 효율적인 미세 조정은 특히 개인화와 같은 애플리케이션에서 매우 바람직합니다. 이러한 애플리케이션은 종종 개인 데이터와 함께 모바일 폰과 같은 엣지 디바이스에서 실행되어야 하기 때문입니다. 본 연구에서는 Textual Inversion을 통해 개인화된 디퓨전 모델을 양자화하고, 역양자화 없이 개인화 토큰에 대해 제로차 최적화를 활용하여 상당한 메모리를 소모하는 역전파를 위한 그래디언트 및 활성화 저장이 필요하지 않도록 함으로써 이 문제를 해결합니다. 개인화에서 단일 또는 소수의 이미지에 대해 제로차 최적화를 사용한 그래디언트 추정은 상당히 노이즈가 많기 때문에, 우리는 과거 토큰의 기록으로 구성된 부분 공간에 추정된 그래디언트를 투영하여 노이즈를 제거하는 방법을 제안합니다. 이를 Subspace Gradient라고 명명합니다. 또한, 우리는 이미지 생성에서 텍스트 임베딩의 영향을 조사하여, 효과적인 디퓨전 타임스텝을 위한 샘플링 방법인 Partial Uniform Timestep Sampling을 제안합니다. 우리의 방법은 Stable Diffusion의 개인화에서 이전 방법들과 비교 가능한 이미지 및 텍스트 정렬 점수를 달성하면서, 순전파만을 사용하여 학습 메모리 요구량을 최대 8.2배까지 줄입니다.

English

Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to 8.2times.

역전파 없이 양자화된 확산 모델의 효율적인 개인화

Efficient Personalization of Quantized Diffusion Model without Backpropagation

초록

Support