生成AIのLLMを超えた展開：マルチモーダル生成のシステム的影響

要旨

大規模生成AIモデルの開発がテキスト（1D）生成から画像（2D）や動画（3D）生成へと進化するにつれ、空間的および時間的情報の処理は、品質、性能、効率性において独自の課題を提示します。本論文では、マルチモーダルなテキストから画像（TTI）およびテキストから動画（TTV）生成モデルのための新しいシステム設計空間を理解するための最初の取り組みを紹介します。現在のモデルアーキテクチャ設計は、DiffusionベースとTransformerベースの2つのカテゴリに分かれています。代表的な8つのTTI/TTVモデルに対する体系的な性能評価により、Flash Attentionなどの最先端の最適化技術を適用した後、DiffusionベースのTTIモデルではConvolutionが実行時間の最大44%を占め、TransformerベースのモデルではLinear層が実行時間の最大49%を消費することが明らかになりました。さらに、DiffusionベースのTTIモデルはLLM推論のPrefill段階に類似しており、Flash Attentionによる速度向上がTransformerベースのTTIモデル（Decode段階に類似）よりも1.1～2.5倍大きいことが観察されました。LLM向けに設計された最適化がTTI/TTVモデルに直接適用できないため、これらのワークロードを徹底的に評価し、新しい最適化の機会を探る必要があります。その過程で、TTI/TTVモデルの文脈におけるシーケンス長を定義し、Diffusionモデル推論ではシーケンス長が最大4倍変動することを観察しました。さらに、TTVワークロードの時間的側面が独自のシステムボトルネックを引き起こし、Temporal Attentionが総Attention時間の60%以上を占めることがわかりました。全体として、この詳細なシステム性能評価は、新興のTTI/TTVワークロード向けに効率的で展開可能なシステムを設計するための重要な第一歩です。

English

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

生成AIのLLMを超えた展開：マルチモーダル生成のシステム的影響

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

要旨

Support