RDTF：面向多帧动画贴纸生成的资源高效双掩码训练框架

摘要

近期，视频生成技术取得了显著进展，引起了学者们的广泛关注。为了在资源受限的条件下将该技术应用于下游任务，研究者通常基于参数高效调优方法（如Adapter或Lora）对预训练模型进行微调。尽管这些方法能够将知识从源域迁移至目标域，但较少的训练参数导致模型拟合能力不足，且源域知识可能使推理过程偏离目标域。本文提出，在资源受限的情况下，仅使用百万级样本从头训练一个较小的视频生成模型，在下游应用中能够超越对更大模型的参数高效调优：其核心在于数据与课程策略的有效利用。以动画贴纸生成（ASG）为例，我们首先构建了一个适用于低帧率贴纸的离散帧生成网络，确保其参数满足资源受限下的模型训练要求。为了为从头训练的模型提供数据支持，我们提出了一种基于双掩码的数据利用策略，有效提升了有限数据的可用性并扩展了其多样性。为了促进双掩码情况下的模型收敛，我们提出了一种难度自适应的课程学习方法，将样本熵分解为静态与自适应成分，从而循序渐进地获取从易到难的样本。实验表明，我们提出的资源高效双掩码训练框架在定量与定性评估上均优于I2V-Adapter和SimDA等参数高效调优方法，验证了该方法在资源受限条件下应用于下游任务的可行性。代码将公开提供。

English

Recently, great progress has been made in video generation technology, attracting the widespread attention of scholars. To apply this technology to downstream applications under resource-constrained conditions, researchers usually fine-tune the pre-trained models based on parameter-efficient tuning methods such as Adapter or Lora. Although these methods can transfer the knowledge from the source domain to the target domain, fewer training parameters lead to poor fitting ability, and the knowledge from the source domain may lead to the inference process deviating from the target domain. In this paper, we argue that under constrained resources, training a smaller video generation model from scratch using only million-level samples can outperform parameter-efficient tuning on larger models in downstream applications: the core lies in the effective utilization of data and curriculum strategy. Take animated sticker generation (ASG) as a case study, we first construct a discrete frame generation network for stickers with low frame rates, ensuring that its parameters meet the requirements of model training under constrained resources. In order to provide data support for models trained from scratch, we come up with a dual-mask based data utilization strategy, which manages to improve the availability and expand the diversity of limited data. To facilitate convergence under dual-mask situation, we propose a difficulty-adaptive curriculum learning method, which decomposes the sample entropy into static and adaptive components so as to obtain samples from easy to difficult. The experiment demonstrates that our resource-efficient dual-mask training framework is quantitatively and qualitatively superior to efficient-parameter tuning methods such as I2V-Adapter and SimDA, verifying the feasibility of our method on downstream tasks under constrained resources. Code will be available.

RDTF：面向多帧动画贴纸生成的资源高效双掩码训练框架

RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation

摘要

Support