UniF^2ace：基於統一多模態模型的細粒度人臉理解與生成

摘要

統一多模態模型（Unified Multimodal Models, UMMs）已成為基礎計算機視覺研究中的一個強大範式，在圖像理解和生成方面展現出顯著潛力。然而，現有人臉領域的研究主要集中在粗粒度的人臉屬性理解上，處理細粒度人臉屬性的能力有限，且未涉及生成能力。為克服這些限制，我們提出了UniF^2ace，這是首個專門針對細粒度人臉理解與生成設計的UMM。總體而言，我們利用兩種互補的擴散技術和一個兩級專家混合架構，在自建的專用數據集上訓練UniF^2ace。具體來說，我們首先構建了一個大規模的人臉數據集UniF^2ace-130K，其中包含13萬個圖像-文本對及一百萬個問答對，涵蓋了廣泛的人臉屬性。其次，我們建立了離散擴散分數匹配與掩碼生成模型之間的理論聯繫，同時優化兩者的證據下界，這顯著提升了模型合成面部細節的能力。最後，我們引入了令牌級和序列級的專家混合，使得模型在理解和生成任務中都能高效地進行細粒度表示學習。在UniF^2ace-130K上的大量實驗表明，UniF^2ace在理解和生成任務上均超越了現有的UMMs和生成模型，取得了優異的性能。

English

Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on coarse facial attribute understanding, with limited capacity to handle fine-grained facial attributes and without addressing generation capabilities. To overcome these limitations, we propose UniF^2ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train UniF^2ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, UniF^2ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on UniF^2ace-130K demonstrate that UniF^2ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

UniF^2ace：基於統一多模態模型的細粒度人臉理解與生成

UniF^2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

摘要

Support