拡散トランスフォーマーのためのインコンテキストLoRA

要旨

最近の研究 arXiv:2410.15027 では、拡散トランスフォーマー（DiTs）を用いて、画像生成のためのタスクに依存しない手法が探究されました。この手法は、単純に画像間で注意トークンを連結することで実現されます。しかし、膨大な計算リソースにも関わらず、生成された画像の忠実度は最適とは言えません。本研究では、テキストから画像への DiTs が、コンテキスト内での生成能力を本質的に持つという仮説を立て、これを活性化するために最小限の調整のみが必要であると再評価し、フレームワークを合理化します。多様なタスク実験を通じて、既存のテキストから画像への DiTs が、調整なしで効果的にコンテキスト内での生成を行うことを質的に示します。この洞察を基に、DiTs のコンテキスト内能力を活用するための非常にシンプルなパイプラインを提案します：（1）トークンの代わりに画像を連結、（2）複数の画像の共同キャプショニングを行い、（3）大規模なデータセットではなく、小規模なデータセット（例：20から100サンプル）を使用してタスク固有の LoRA 調整を適用します。このアプローチは、In-Context LoRA（IC-LoRA）と名付けられています。この手法は、元の DiT モデルに変更を加える必要はなく、トレーニングデータのみが変更されます。驚くべきことに、当社のパイプラインは、プロンプトにより適合した高忠実度の画像セットを生成します。調整データに関してはタスク固有ですが、当社のフレームワークはアーキテクチャとパイプラインにおいてタスクに依存しないため、コミュニティにとって強力なツールを提供し、製品レベルのタスクに依存しない生成システムに関するさらなる研究に貴重な示唆を提供します。当社のコード、データ、およびモデルは、https://github.com/ali-vilab/In-Context-LoRA で公開されています。

English

Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., 20sim 100 samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA

拡散トランスフォーマーのためのインコンテキストLoRA

In-Context LoRA for Diffusion Transformers

要旨

Support