交互式全模态:面向音视频多轮对话的统一全模态模型
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
October 15, 2025
作者: Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, Xueheng Li, Lumin Li, Chenxu Guo, Jiasheng Zhou, Jiandong Chen, Xianye Wu, Jiahao Wang, Silei Wu, Lei Chen, Hanming Deng, Yuxuan Song, Dinghao Zhou, Guiping Zhong, Ken Zheng, Shiyin Kang, Lewei Lu
cs.AI
摘要
我们推出InteractiveOmni,这是一款统一且开源的、面向音视频多轮交互的全模态大语言模型,其参数量从4B至8B不等,旨在通过提供全面的全模态理解与语音生成能力,引领轻量级模型领域的发展。为此,我们将视觉编码器、音频编码器、大语言模型及语音解码器整合为一个统一模型,以执行理解与生成任务。我们设计了一种多阶段训练策略,确保模型具备强大的跨模态能力,包括全模态理解的预训练,以及后续的语音对话与音视频交互的后训练。为了实现类人的长期对话能力,我们精心策划了一个多轮训练数据集,以增强模型处理复杂多轮交互的能力。为了有效评估多轮记忆与语音交互能力,我们构建了多模态多轮记忆基准和多轮语音交互基准。实验表明,InteractiveOmni显著超越了领先的开源模型,提供了更为智能的多轮音视频体验,尤其是在其长期记忆能力方面表现突出。值得注意的是,InteractiveOmni-4B在通用基准测试中可与Qwen2.5-Omni-7B等更大模型相媲美,且仅需50%的模型规模即可保留InteractiveOmni-8B 97%的性能。在图像、音频、视频理解及语音生成任务中,InteractiveOmni均取得了与同类规模模型相比的顶尖成绩,为下一代智能交互系统提供了一个易于获取、开源的基础平台。
English
We introduce InteractiveOmni, a unified and open-source omni-modal large
language model for audio-visual multi-turn interaction, ranging from 4B to 8B
parameters, designed to lead the field of lightweight models by offering
comprehensive omni-modal understanding and speech generation capabilities. To
achieve this, we integrate the vision encoder, audio encoder, large language
model, and speech decoder into a unified model for understanding and generation
tasks. We design a multi-stage training strategy to ensure robust cross-modal
capabilities, including pre-training for omni-modal understanding, followed by
post-training with speech conversation and audio-visual interaction. To enable
human-like long-term conversational ability, we meticulously curate a
multi-turn training dataset that enhances the model's ability to handle complex
and multi-turn interactions. To effectively evaluate the multi-turn memory and
speech interaction capabilities, we construct the multi-modal multi-turn memory
benchmark and the multi-turn speech interaction benchmark. Experiments
demonstrate that InteractiveOmni significantly outperforms leading open-source
models and provides a more intelligent multi-turn audio-visual experience,
particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B
is comparable to the much larger model like Qwen2.5-Omni-7B on general
benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B
while utilizing only 50% of the model size. Achieving state-of-the-art results
against similarly sized models across image, audio, video understanding, and
speech generation tasks, InteractiveOmni is an accessible, open-source
foundation for next-generation intelligent interactive systems.