ChatPaper.aiChatPaper

InteractiveOmni:面向音視多輪對話的統一全模態模型

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

October 15, 2025
作者: Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, Xueheng Li, Lumin Li, Chenxu Guo, Jiasheng Zhou, Jiandong Chen, Xianye Wu, Jiahao Wang, Silei Wu, Lei Chen, Hanming Deng, Yuxuan Song, Dinghao Zhou, Guiping Zhong, Ken Zheng, Shiyin Kang, Lewei Lu
cs.AI

摘要

我們推出InteractiveOmni,這是一個統一且開源的全模態大型語言模型,參數量從4B到8B不等,旨在通過提供全面的全模態理解與語音生成能力,引領輕量級模型領域的發展。為實現這一目標,我們將視覺編碼器、音頻編碼器、大型語言模型和語音解碼器整合為一個統一模型,用於理解與生成任務。我們設計了多階段訓練策略,以確保強大的跨模態能力,包括全模態理解的預訓練,以及語音對話和音視頻交互的後續訓練。為了實現類人的長期對話能力,我們精心策劃了一個多輪訓練數據集,增強模型處理複雜多輪交互的能力。為了有效評估多輪記憶與語音交互能力,我們構建了多模態多輪記憶基準和多輪語音交互基準。實驗表明,InteractiveOmni顯著優於領先的開源模型,並提供了更智能的多輪音視頻體驗,特別是在其長期記憶能力方面。值得注意的是,InteractiveOmni-4B在通用基準測試中可與更大規模的模型如Qwen2.5-Omni-7B相媲美,並且在僅使用50%模型大小的情況下,能保留InteractiveOmni-8B 97%的性能。在圖像、音頻、視頻理解及語音生成任務中,InteractiveOmni對抗同規模模型取得了最先進的成果,是下一代智能交互系統的易於獲取、開源的基礎。
English
We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.
PDF282October 16, 2025