JAM-Flow : Synthèse conjointe audio-mouvement par correspondance de flux

papers.abstract

Le lien intrinsèque entre le mouvement facial et la parole est souvent négligé dans la modélisation générative, où la synthèse de têtes parlantes et la conversion de texte en parole (TTS) sont généralement traitées comme des tâches distinctes. Cet article présente JAM-Flow, un cadre unifié pour synthétiser et conditionner simultanément le mouvement facial et la parole. Notre approche exploite le *flow matching* et une nouvelle architecture de *Multi-Modal Diffusion Transformer* (MM-DiT), intégrant des modules spécialisés Motion-DiT et Audio-DiT. Ces modules sont couplés via des couches d'attention conjointe sélective et intègrent des choix architecturaux clés, tels que des embeddings positionnels temporellement alignés et un masquage localisé de l'attention conjointe, pour permettre une interaction intermodale efficace tout en préservant les forces spécifiques à chaque modalité. Entraîné avec un objectif de style *inpainting*, JAM-Flow prend en charge une large gamme d'entrées de conditionnement, y compris le texte, l'audio de référence et le mouvement de référence, facilitant des tâches telles que la génération synchronisée de têtes parlantes à partir de texte, l'animation pilotée par l'audio, et bien plus encore, le tout au sein d'un modèle unique et cohérent. JAM-Flow représente une avancée significative dans la modélisation générative multimodale en offrant une solution pratique pour la synthèse audio-visuelle holistique. Page du projet : https://joonghyuk.com/jamflow-web

English

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

JAM-Flow : Synthèse conjointe audio-mouvement par correspondance de flux

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

papers.abstract

Support