OmniFlow: Geração de Qualquer para Qualquer com Fluxos Retificados Multimodais

Resumo

Apresentamos o OmniFlow, um modelo generativo inovador projetado para tarefas de geração de qualquer para qualquer, como texto para imagem, texto para áudio e síntese de áudio para imagem. O OmniFlow avança o framework de fluxo retificado (RF) utilizado em modelos de texto para imagem para lidar com a distribuição conjunta de múltiplas modalidades. Ele supera modelos anteriores de qualquer para qualquer em uma ampla gama de tarefas, como síntese de texto para imagem e texto para áudio. Nosso trabalho oferece três contribuições-chave: Primeiro, estendemos o RF para um ambiente multi-modal e introduzimos um mecanismo de orientação inovador, permitindo aos usuários controlar flexivelmente o alinhamento entre diferentes modalidades nas saídas geradas. Segundo, propomos uma arquitetura inovadora que estende a arquitetura MMDiT de texto para imagem do Stable Diffusion 3 e possibilita geração de áudio e texto. Os módulos estendidos podem ser eficientemente pré-treinados individualmente e mesclados com o MMDiT de texto para imagem convencional para ajustes finos. Por fim, realizamos um estudo abrangente sobre as escolhas de design de transformadores de fluxo retificado para geração de áudio e texto em larga escala, fornecendo insights valiosos para otimizar o desempenho em diversas modalidades. O código estará disponível em https://github.com/jacklishufan/OmniFlows.

English

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

OmniFlow: Geração de Qualquer para Qualquer com Fluxos Retificados Multimodais

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Resumo

Support