CollectionLoRA: マルチ教師オンポリシー蒸留により1つのLoRAに50のエフェクトを収集する手法

要旨

カスタマイズ画像編集は、限られたペアデータを用いて事前学習済み拡散モデルに特定の視覚効果を付与することを目的とし、通常はLow-Rank Adaptation（LoRA）によって実現される。所望の効果の数が増加するにつれて、多数のエフェクトLoRAを保存し動的に読み込むことで、導入のオーバーヘッドが大幅に増大する。さらに、現在のパイプラインは一般的に、これらのエフェクトLoRAを高速生成のための高速化モジュールとカスケード接続するが、これにより深刻なパラメータ干渉が発生し、コンセプトの混ざり込みやスタイルの劣化を引き起こす。本稿では、最大50種類の異なるエフェクトLoRAのコンセプトと少数ステップ生成能力を単一のLoRAに蒸留可能な、マルチティーチャー・オンポリシー蒸留フレームワークであるCollectionLoRAを提案する。これにより、特徴干渉問題を根本的に解決し、導入コストを大幅に削減する。具体的には、本手法は以下の要素を導入する。（i）学習中にモデルがデータソースをランダムに切り替えられるようにし、未見のシナリオにおける汎化性を効果的に高める確率的デュアルストリームルーティング機構、（ii）プロンプト空間内でコンセプトの分離を実現する非対称直交プロンプティング戦略、（iii）教師モデルと生徒モデル間の分布のギャップを緩和する粗密蒸留目的関数。広範な評価により、CollectionLoRAはすべてのカスタマイズ効果と少数ステップ生成を単一のLoRAに蒸留し、導入オーバーヘッドを低減しつつ、独立して学習した教師モデルと同等以上のコンセプト忠実度を達成することが示された。

English

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models.