ChatPaper.aiChatPaper

Ex-Omni:为全模态大语言模型实现三维面部动画生成

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

February 6, 2026
作者: Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu
cs.AI

摘要

全模态大语言模型(OLLMs)致力于统一多模态理解与生成能力,然而尽管语音与3D面部动画的结合对自然交互至关重要,该领域仍鲜有研究。核心挑战源于大语言模型的离散化、令牌级语义推理与3D面部运动所需的密集细粒度时序动态之间存在表征失配,导致在有限数据下直接建模难以优化。我们提出Expressive Omni(Ex-Omni)这一开源全模态框架,通过语音伴随的3D面部动画增强OLLMs。Ex-Omni通过解耦语义推理与时序生成降低学习难度,利用语音单元作为时序支架,并采用统一的令牌即查询门控融合(TQGF)机制实现可控语义注入。我们进一步构建了InstructEx数据集,旨在促进语音伴随3D面部动画的OLLMs增强研究。大量实验表明,Ex-Omni在保持与现有开源OLLMs竞争性能的同时,能够稳定生成对齐的语音与面部动画。
English
Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.
PDF112February 13, 2026