RDTF：面向多帧動態貼圖生成的資源高效雙遮罩訓練框架

摘要

近期，視頻生成技術取得了重大進展，引起了學界的廣泛關注。為了在資源受限的條件下將該技術應用於下游任務，研究者通常基於參數高效調優方法（如Adapter或Lora）對預訓練模型進行微調。儘管這些方法能夠將源領域的知識遷移至目標領域，但較少的訓練參數導致模型擬合能力不足，且源領域的知識可能使推理過程偏離目標領域。本文主張，在資源受限的情況下，僅使用百萬級樣本從頭訓練一個較小的視頻生成模型，在下游應用中能夠超越對更大模型的參數高效調優：其核心在於數據的有效利用與課程策略的設計。以動態貼紙生成（ASG）為例，我們首先構建了一個適用於低幀率貼紙的離散幀生成網絡，確保其參數滿足資源受限下的模型訓練要求。為了為從頭訓練的模型提供數據支持，我們提出了一種基於雙掩碼的數據利用策略，有效提升了有限數據的可用性並擴展了其多樣性。為促進雙掩碼情況下的模型收斂，我們提出了一種難度自適應的課程學習方法，將樣本熵分解為靜態與自適應兩部分，從而實現從易到難的樣本獲取。實驗表明，我們提出的資源高效雙掩碼訓練框架在定量與定性評估上均優於I2V-Adapter和SimDA等參數高效調優方法，驗證了該方法在資源受限下執行下游任務的可行性。代碼將公開提供。

English

Recently, great progress has been made in video generation technology, attracting the widespread attention of scholars. To apply this technology to downstream applications under resource-constrained conditions, researchers usually fine-tune the pre-trained models based on parameter-efficient tuning methods such as Adapter or Lora. Although these methods can transfer the knowledge from the source domain to the target domain, fewer training parameters lead to poor fitting ability, and the knowledge from the source domain may lead to the inference process deviating from the target domain. In this paper, we argue that under constrained resources, training a smaller video generation model from scratch using only million-level samples can outperform parameter-efficient tuning on larger models in downstream applications: the core lies in the effective utilization of data and curriculum strategy. Take animated sticker generation (ASG) as a case study, we first construct a discrete frame generation network for stickers with low frame rates, ensuring that its parameters meet the requirements of model training under constrained resources. In order to provide data support for models trained from scratch, we come up with a dual-mask based data utilization strategy, which manages to improve the availability and expand the diversity of limited data. To facilitate convergence under dual-mask situation, we propose a difficulty-adaptive curriculum learning method, which decomposes the sample entropy into static and adaptive components so as to obtain samples from easy to difficult. The experiment demonstrates that our resource-efficient dual-mask training framework is quantitatively and qualitatively superior to efficient-parameter tuning methods such as I2V-Adapter and SimDA, verifying the feasibility of our method on downstream tasks under constrained resources. Code will be available.

RDTF：面向多帧動態貼圖生成的資源高效雙遮罩訓練框架

RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation

摘要

Support