UniF^2ace: 統一マルチモーダルモデルによる細粒度顔理解と生成

要旨

統一マルチモーダルモデル（UMMs）は、基礎的なコンピュータビジョン研究において強力なパラダイムとして登場し、画像理解と生成の両方で大きな可能性を示しています。しかし、顔領域における既存の研究は主に粗い顔属性の理解に焦点を当てており、細粒度の顔属性を扱う能力が限られており、生成能力には対応していません。これらの制限を克服するため、我々は細粒度の顔理解と生成に特化した最初のUMMであるUniF^2aceを提案します。一般的に、UniF^2aceは、相互に有益な2つの拡散技術と2段階のエキスパート混合アーキテクチャを活用して、独自に構築した専門データセットで訓練します。具体的には、まず、130Kの画像-テキストペアと100万の質問-回答ペアを含む大規模な顔データセット、UniF^2ace-130Kを構築します。これらは幅広い顔属性をカバーしています。次に、離散拡散スコアマッチングとマスク生成モデルの間の理論的接続を確立し、両方の証拠下限を同時に最適化することで、モデルの顔の詳細を合成する能力を大幅に向上させます。最後に、トークンレベルとシーケンスレベルのエキスパート混合を導入し、理解と生成タスクの両方に対して効率的な細粒度表現学習を可能にします。UniF^2ace-130Kでの広範な実験により、UniF^2aceが既存のUMMsや生成モデルを上回り、理解と生成タスクの両方で優れた性能を達成することが実証されています。

English

Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on coarse facial attribute understanding, with limited capacity to handle fine-grained facial attributes and without addressing generation capabilities. To overcome these limitations, we propose UniF^2ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train UniF^2ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, UniF^2ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on UniF^2ace-130K demonstrate that UniF^2ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

UniF^2ace: 統一マルチモーダルモデルによる細粒度顔理解と生成

UniF^2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

要旨

Support