RDTF: 다중 프레임 애니메이션 스티커 생성을 위한 자원 효율적 이중 마스크 학습 프레임워크

초록

최근 비디오 생성 기술에서 큰 진전이 이루어져 학계의 폭넓은 관심을 끌고 있습니다. 이 기술을 자원이 제한된 조건에서의 다운스트림 애플리케이션에 적용하기 위해 연구자들은 일반적으로 Adapter나 Lora와 같은 파라미터 효율적인 튜닝 방법을 기반으로 사전 훈련된 모델을 미세 조정합니다. 이러한 방법들은 소스 도메인의 지식을 타겟 도메인으로 전이할 수 있지만, 적은 수의 훈련 파라미터로 인해 적합 능력이 떨어지고, 소스 도메인의 지식이 타겟 도메인에서의 추론 과정을 벗어나게 할 수 있습니다. 본 논문에서는 제한된 자원 하에서, 더 큰 모델에 대한 파라미터 효율적 튜닝보다 백만 수준의 샘플만을 사용해 처음부터 더 작은 비디오 생성 모델을 훈련시키는 것이 다운스트림 애플리케이션에서 더 나은 성능을 낼 수 있다고 주장합니다: 핵심은 데이터와 커리큘럼 전략의 효과적인 활용에 있습니다. 애니메이션 스티커 생성(ASG)을 사례 연구로 삼아, 먼저 낮은 프레임 속도를 가진 스티커를 위한 이산 프레임 생성 네트워크를 구축하여, 제한된 자원 하에서 모델 훈련 요구 사항을 충족하도록 합니다. 처음부터 훈련된 모델을 위한 데이터 지원을 제공하기 위해, 이중 마스크 기반 데이터 활용 전략을 제안하여 제한된 데이터의 가용성을 향상시키고 다양성을 확장합니다. 이중 마스크 상황에서의 수렴을 용이하게 하기 위해, 샘플 엔트로피를 정적 및 적응적 구성 요소로 분해하여 쉬운 것부터 어려운 순으로 샘플을 얻는 난이도 적응형 커리큘럼 학습 방법을 제안합니다. 실험 결과, 우리의 자원 효율적 이중 마스크 훈련 프레임워크가 I2V-Adapter 및 SimDA와 같은 파라미터 효율적 튜닝 방법보다 양적 및 질적으로 우수함을 보여주어, 제한된 자원 하에서의 다운스트림 작업에 대한 우리의 방법의 타당성을 검증합니다. 코드는 공개될 예정입니다.

English

Recently, great progress has been made in video generation technology, attracting the widespread attention of scholars. To apply this technology to downstream applications under resource-constrained conditions, researchers usually fine-tune the pre-trained models based on parameter-efficient tuning methods such as Adapter or Lora. Although these methods can transfer the knowledge from the source domain to the target domain, fewer training parameters lead to poor fitting ability, and the knowledge from the source domain may lead to the inference process deviating from the target domain. In this paper, we argue that under constrained resources, training a smaller video generation model from scratch using only million-level samples can outperform parameter-efficient tuning on larger models in downstream applications: the core lies in the effective utilization of data and curriculum strategy. Take animated sticker generation (ASG) as a case study, we first construct a discrete frame generation network for stickers with low frame rates, ensuring that its parameters meet the requirements of model training under constrained resources. In order to provide data support for models trained from scratch, we come up with a dual-mask based data utilization strategy, which manages to improve the availability and expand the diversity of limited data. To facilitate convergence under dual-mask situation, we propose a difficulty-adaptive curriculum learning method, which decomposes the sample entropy into static and adaptive components so as to obtain samples from easy to difficult. The experiment demonstrates that our resource-efficient dual-mask training framework is quantitatively and qualitatively superior to efficient-parameter tuning methods such as I2V-Adapter and SimDA, verifying the feasibility of our method on downstream tasks under constrained resources. Code will be available.

RDTF: 다중 프레임 애니메이션 스티커 생성을 위한 자원 효율적 이중 마스크 학습 프레임워크

RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation

초록

Support